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Abstract This article describes the development and formal verification (proof of 
semantic preservation) of a compiler back-end from Cminor (a simple imperative inter- 
mediate language) to PowerPC assembly code, using the Coq proof assistant both for 
programming the compiler and for proving its soundness. Such a verified compiler is 
useful in the context of formal methods applied to the certification of critical software: 
the verification of the compiler guarantees that the safety properties proved on the 
source code hold for the executable compiled code as well. 
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1 Introduction 

Can you trust your compiler? Compilers are generally assumed to be semantically trans- 
parent: the compiled code should behave as prescribed by the semantics of the source 
program. Yet, compilers — and especially optimizing compilers — are complex programs 
that perform complicated symbolic transformations. Despite intensive testing, bugs in 
compilers do occur, causing the compiler to crash at compile time or — much worse — to 
silently generate an incorrect executable for a correct source program [67,65,31]. 

For low-assurance software, validated only by testing, the impact of compiler bugs 
is low: what is tested is the executable code produced by the compiler; rigorous testing 
should expose compiler-introduced errors along with errors already present in the source 
program. Note, however, that compiler-introduced bugs are notoriously difficult to 
track down. Moreover, test plans need to be made more complex if optimizations are 
to be tested: for example, loop unrolling introduces additional limit conditions that are 
not apparent in the source loop. 

The picture changes dramatically for safety-critical, high-assurance software. Here, 
validation by testing reaches its limits and needs to be complemented or even replaced 
by the use of formal methods: model checking, static analysis, program proof, etc.. 



X. Leroy 

INRIA Paris-Rocquencourt, B.P. 105, 78153 Le Chesnay, France 
E-mail; Xavier.Leroy@inria.fr 



2 



Almost universally, formal methods are applied to the source code of a program. Bugs 
in the compiler that is used to turn this formally verified source code into an executable 
can potentially invalidate all the guarantees so painfully obtained by the use of formal 
methods. In a future whore formal methods are routinely applied to source programs, 
the compiler could appear as a weak link in the chain that goes from specifications to 
executables. 

The safety-critical software industry is aware of these issues and uses a variety of 
techniques to alleviate them: even more testing (of the compiler and of the generated 
executable); turning compiler optimizations off; and in extreme cases, conducting man- 
ual code reviews of the generated assembly code. These techniques do not fully address 
the issue and are costly in terms of development time and program performance. 

An obviously better approach is to apply formal methods to the compiler itself in 
order to gain assurance that it preserves the semantics of the source programs. Many 
different approaches have been proposed and investigated, including on-paper and on- 
machine proofs of semantic preservation, proof-carrying code, credible compilation, 
translation validation, and type-preserving compilers. (These approaches are compared 
in section 2.) 

For the last four years, we have been working on the development of a realistic, 
verified compiler called Compcert. By verified, we mean a compiler that is accompanied 
by a machine-checked proof that the generated code behaves exactly as prescribed by 
the semantics of the source program (semantic preservation property). By realistic, 
we mean a compiler that could realistically be used in the context of production of 
critical software. Namely, it compiles a language commonly used for critical embedded 
software: not Java, not ML, not assembly code, but a large subset of the C language. 
It produces code for a processor commonly used in embedded systems, as opposed e.g. 
to a virtual machine: wo chose the PowerPC because it is popular in avionics. Finally, 
the compiler must generate code that is efficient enough and compact enough to fit 
the requirements of critical embedded systems. This implies a multi-pass compiler that 
features good register allocation and some basic optimizations. 

This paper reports on the completion of a large part of this program: the formal 
verification of a lightly-optimizing compiler back-end that generates PowerPC assembly 
code from a simple imperative intermediate language called Cminor. This verification 
is mechanized using the Coq proof assistant [25,11]. Another part of this program — 
the verification of a compiler front-end translating a subset of C called Clight down to 
Cminor — has also been completed and is described separately [15,16]. 

While there exists a considerable body of earlier work on machine-checked correct- 
ness proofs of parts of compilers (see section 18 for a review), our work is novel in 
two ways. First, published work tends to focus on a few parts of a compiler, such as 
optimizations and the underlying static analyses [55,19] or translation of a high-level 
language to virtual machine code [49]. In contrast, our work emphasizes end-to-end 
verification of a complete compilation chain from a structured imperative language 
down to assembly code through 6 intermediate languages. We found that many of 
the non-optimizing translations performed, while often considered obvious in compiler 
literature, are surprisingly tricky to prove correct formally. 

Another novelty of this work is that most of the compiler is written directly in the 
Coq specification language, in a purely functional style. The executable compiler is 
obtained by automatic extraction of Caml code from this specification. This approach 
is an attractive alternative to writing the compiler in a conventional programming 
language, then using a program logic to relate it with its specifications. This approach 
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Fig. 1 The passes and intermediate languages of Compcert. 



has never been applied before to a program of the size and complexity of an optimizing 
compiler. 

The complete source code of the Coq development, extensively commented, is avail- 
able on the Web [58] . We take advantage of this availability to omit proofs and a number 
of low-lcvol details from this article, referring the interested reader to the Coq devel- 
opment instead. The purpose of this article is to give a high-level presentation of a 
verified back-end, with just enough details to enable readers to apply similar tech- 
niques in other contexts. The general perspective we adopt is to revisit classic compiler 
technology from the viewpoint of the semanticist, in particular by distinguishing clearly 
between the correctness-relevant and the performance-relevant aspects of compilation 
algorithms, which are inextricably mixed in compiler literature. 

The remainder of this article is organized as follows. Section 2 formalizes various 
approaches to establishing trust in the results of compilation. Section 3 presents the 
main aspects of the development that are shared between all passes of the compiler: the 
value and memory models, labeled transition semantics, proofs by simulation diagrams. 
Sections 4 and 13 define the semantics of our source language Cminor and our target 
language PPC, respectively. The bulk of this article (sections 5 to 14) is devoted to 
the description of the successive passes of the compiler, the intermediate languages 
they operate on, and their soundness proofs. (Figure 1 summarizes the passes and 
the intermediate languages.) Experimental data on the Coq development and on the 
executable compiler extracted from it are presented in sections 15 and 16. Section 17 
discusses some of the design choices and possible extensions. Related work is discussed 
in section 18, followed by concluding remarks in section 19. 



2 General framework 

2.1 Notions of semantic preservation 

Consider a source program S and a compiled program C produced by a compiler. Our 
aim is to prove that the semantics of S was preserved during compilation. To make 
this notion of semantic preservation precise, we assume given semantics for the source 

language Ls and the target language Lt- These semantics associate one or several ob- 
servable behaviors B to S and C. Typically, observable behaviors include termination, 
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divergence, and "going wrong" on executing an undefined computation. (In the remain- 
der of this work, behaviors also contain traces of input-output operations performed 
during program execution.) We write S i}. B to mean that program S executes with 
observable behavior B, and likewise for C. 

The strongest notion of semantic preservation during compilation is that the 
source program S and the compiled code C have exactly the same sets of observable 
behaviors — a standard bisimulation property: 

Definition 1 (Bisimulation) VS, S ii- B <^ C ii-B. 

Definition 1 is too strong to be usable as our notion of semantic preservation. If the 
source language is not deterministic, compilers are allowed to select one of the possible 
behaviors of the source program. (For instance, C compilers choose one particular eval- 
uation order for expressions among the several orders allowed by the C specifications.) 
In this case, C will have fewer behaviors than S. To account for this degree of freedom, 
we can consider a backward simulation, or refinement, property: 

Definition 2 (Backward simulation) VB, C i}, B ^ S ]}. B. 

Definitions 1 and 2 imply that if S always goes wrong, so does C. Several desirable 
optimizations violate this requirement. For instance, if S contains an integer division 
whose result is unused, and this division can cause S to go wrong because its second 
argument is zero, dead code elimination will result in a compiled program C that does 
not go wrong on this division. To leave more flexibility to the compiler, we cam therefore 
restrict the backward simulation property to safe source programs. A program S is safe, 
written Safe{S), if none of its possible behaviors is in the set Wrong of "going wrong" 
behaviors {S B ^ B ^ Wrong). 

Definition 3 (Backward simulation for safe programs) If Safe{S), then VB, C JJ. 
B^Si}.B. 

In other words, if S cannot go wrong (a fact that can be established by formal 
verification or static analysis of S), then neither does C; moreover, all observable 
behaviors of C are acceptable behaviors of S. 

An alternative to backward simulation (definitions 2 and 3) is forward simulation 
properties, showing that all possible behaviors of the source program are also possible 
behaviors of the compiled program: 

Definition 4 (Forward simulation) VB, S ij- B ^ C ij. B. 

Definition 5 (Forward simulation for safe programs) \/B ^ Wrong, S ij- B =^ 
Cii-B. 

In general, forward simulations are easier to prove than backward simulations (by 
structural induction on an execution of S), but less informative: even if forward simu- 
lation holds, the compiled code C could have additional, undesirable behaviors beyond 
those of S. However, this cannot happen if C is deterministic, that is, if it admits only 
one observable behavior (C -U- Bi A C JJ. i32 Bi = B2). This is the case if the target 
language Lt has no internal non-determinism (programs change their behaviors only in 
response to different inputs but not because of internal choices) and the execution en- 
vironment is deterministic (inputs given to programs are uniquely determined by their 
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Fig. 2 Various semantic preservation properties and their relationships. An arrow from A to 
B means that A logically implies B. 

previous outputs).^ In this case, it is easy to show that "forward simulation" implies 
"backward simulation" , and "forward simulation for safe programs" implies "backward 
simulation for safe programs". The reverse implications hold if the source program is 
deterministic. Figure 2 summarizes the logical implications between the various notions 
of semantic preservation. 

From a formal methods perspective, what we are really interested in is whether the 
compiled code satisfies the functional specifications of the application. Assume that 
such a specification is given as a predicate Spec{B) of the observable behavior. Further 
assume that the specification rules out "going wrong" behaviors: Spec{B) B ^ 
Wrong. We say that C satisfies the specification, and write C ^ Spec, if all behaviors 
of C satisfy Spec {VB, C i}. B Spec{B)). The expected soundness property of the 
compiler is that it preserves the fact that the source code S satisfies the specification, 
a fact that has been established separately by formal verification of S. 

Definition 6 (Preservation of a specification) S \= Spec C \= Spec. 

It is easy to show that "backward simulation for safe programs" implies "preser- 
vation of a specification" for all specifications Spec. In general, the latter property is 
weaker than the former property. For instance, if the specification of the application is 
"print a prime number", and S prints 7, and C prints 11, the specification is preserved 
but backward simulation does not hold. Therefore, definition 6 leaves more liberty 
for compiler optimizations that do not preserve semantics in general, but are correct 
for specific programs. However, it has the marked disadvantage of depending on the 
specifications of the application, so that changes in the latter can require the proof of 
preservation to be redone. 

A special case of preservation of a specification, of considerable historical impor- 
tance, is the preservation of type and memory safety, which we can summarize as "if 
S does not go wrong, neither does C" : 

Definition 7 (Preservation of safety) Safe{S) =^ SaJe(C). 

^ Section 13.3 formalizes this notion of deterministic execution environment by, in effect, 
restricting the set of behaviors B to those generated by a transition function that responds to 
the outputs of the program. 
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Combined with a separate check that S is well-typed in a sound type system, 
this property implies that C executes without memory violations. Type-preserving 
compilation [72, 71, 21] obtains this guarantee by different means: under the assumption 
that S is well typed, C is proved to be well-typed in a sound type system, ensuring 
that it cannot go wrong. Having proved a semantic preservation property such as 
definitions 3 or 6 provides the same guarantee without having to equip the target and 
intermediate languages with sound type systems and to prove type preservation for the 
compiler. 

In summary, the approach we follow in this work is to prove a "forward simulation 
for safe programs" property (sections 5 to 14), and combine it with a separate proof of 
determinism for the target language (section 13.3), the latter proof being particularly 
easy since the target is a single-threaded assembly language. Combining these two 
proofs, we obtain that all specifications are preserved, in the sense of definition 6, which 
is the result that matters for users of the compiler who practice formal verification at 
the source level. 



2.2 Verified compilers, validated compilers, and certifying compilers 

We now discuss several approaches to establishing that a compiler preserves semantics 
of the compiled programs, in the sense of section 2.1. In the following, we write S w C, 
where 5 is a source program and C is compiled code, to denote one of the semantic 
preservation properties 1 to 7 of section 2.1. 

2.2.1 Verified compilers 

We model the compiler as a total function Comp from source programs to either com- 
piled code (written Gomp{S) = OK(C)) or a compile-time error (written Comp{S) = 
Error). Compile-time errors correspond to cases where the compiler is unable to pro- 
duce code, for instance if the source program is incorrect (syntax error, type error, etc.), 
but also if it exceeds the capacities of the compiler (see section 12 for an example). 

Definition 8 (Verified compiler) A compiler Comp is said to be verified if it is 
accompanied with a formal proof of the following property: 

VS, C, Comp{S) = OK(C) =^SxC (i) 

In other words, a verified compiler either reports an error or produces code that 
satisfies the desired semantic preservation property. Notice that a compiler that always 
fails {Comp{S) = Error for all S) is indeed verified, although useless. Whether the 
compiler succeeds to compile the source programs of interest is not a soundness is- 
sue, but a quality of implementation issue, which is addressed by non-formal methods 
such as testing. The important feature, from a formal methods standpoint, is that the 
compiler never silently produces incorrect code. 

Verifying a compiler in the sense of definition 8 amounts to applying program proof 
technology to the compiler sources, using one of the properties defined in section 2 as 
the high-level specification of the compiler. 



7 



2.2.2 translation validation with verified validators 

In the translation validation approach [83,76] the compiler does not need to be veri- 
fied. Instead, the compiler is complemented by a validator, a boolean-valued function 
Validate{S, C) that verifies the property S ~ C a posteriori. If Comp{S) = OK(C) and 
Validate{S, C) = true, the compiled code C is deemed trustworthy. Validation can be 
performed in several ways, ranging from symbolic interpretation and static analysis 
of S and C [76, 87, 44, 93, 94] to the generation of verification conditions followed by 
model checking or automatic theorem proving [83,95,4]. The property S* « C being 
undecidable in general, validators must err on the side of caution and should reply 
false if they cannot establish S « C.^ 

Translation validation generates additional confidence in the correctness of the 
compiled code but by itself does not provide formal guarantees as strong as those 
provided by a verified compiler: the validator could itself be unsound. 

Definition 9 (Verified validator) A validator Validate is said to be verified if it is 
accompanied with a formal proof of the following property: 

yS, C, Validate{S, C) = true =^ S fa C {ii) 

The combination of a verified validator Validate with an unverified compiler Comp 

does provide formal guarantees as strong as those provided by a verified compiler. Such 
a combination calls the validator after each run of the compiler, reporting a compile- 
time error if validation fails: 

Comp'iS) = 

match Comp{S) with 
I Error — » Error 

I 0K(C7) if Validate{S,C) then OK(C) else Error 

If the source and target languages are identical, as is often the case for optimization 
passes, we also have the option to return the source code unchanged if validation fails, 
in effect turning ofT a potentially incorrect optimization: 

Comp"{S) = 

match Cornp{S) with 
I Error OK{S) 

I OK(C) -> if Validate{S,C) then OK(C) else 0K(5') 

Theorem 1 // Validate is a verified validator in the sense of definition 9, Comp' and 
Comp" are verified compilers in the sense of definition 8. 

Verification of a translation validator is therefore an attractive alternative to the 
verification of a compiler, provided the validator is smaller and simpler than the com- 
piler. 

In the presentation above, the validator receives unadorned source and compiled 
codes as arguments. In practice, the validator can also take advantage of additional 
information generated by the compiler and transmitted to the validator as part of C or 
separately. For instance, the validator of [87] exploits debugging information to suggest 



^ This conservatism doesn't necessarily render validators incomplete: a validator can be 
complete with respect to a particular code transformation or family of transformations. 
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a correspondence between program points and between variables of S and C. Credible 
compilation [86] carries this approach to the extreme: the compiler is supposed to 
annotate C with a full proof of S m C, so that translation validation reduces to proof 
checking. 

2.2.3 Proof- carrying code and certifying compilers 

The proof-carrying code (PCC) approach [75,2,33] does not attempt to establish se- 
mantic preservation between a source program and some compiled code. Instead, PCC 
focuses on the generation of independently-checkable evidence that the compiled code 
C satisfies a behavioral specification Spec such as type and memory safety. PCC makes 
use of a certifying compiler, which is a function CComp that either fails or returns both 
a compiled code C and a proof tt of the property C \= Spec. The proof tt, also called 
a certificate, can be checked independently by the code user; there is no need to trust 
the code producer, nor to formally verify the compiler itself. 

In a naive view of PCC, the certificate tt generated by the compiler is a full proof 
term and the client-side verifier is a general-purpose proof checker. In practice, it is 
sufficient to generate enough hints so that such a full proof can be reconstructed cheaply 
on the client side by a specialized checker [78]. If the property of interest is type safety, 
PCC can reduce to type-checking of compiled code, as in Java bytecode verification 
[90] or typed assembly language [72] : the certificate tt reduces to type annotations, and 
the client-side verifier is a type checker. 

In the original PCC design, the certifying compiler is specialized for a fixed property 
of programs (e.g. type and memory safety), and this property is simple enough to 
be established by the compiler itself. For richer properties, it becomes necessary to 
provide the certifying compiler with a certificate that the source program S satisfies 
the property. It is also possible to make the compiler generic with respect to a family 
of program properties. This extension of PCC is called proof-preserving compilation in 
[89] and certificate translation in [7,8]. 

In all cases, it suffices to formally verify the client-side checker to obtain guarantees 
as strong as those obtained from compiler verification. Symmetrically, a certifying com- 
piler can be constructed (at least theoretically) from a verified compiler. Assume that 
Comp is a verified compiler, using definition 6 as our notion of semantic preservation, 
and further assume that the verification was conducted with a proof assistant that pro- 
duces proof terms, such as Coq. Let 77 be a proof term for the semantic preservation 
theorem of Comp, namely 

77 : V5, C, Comp{S) = OK(C) ^ S ^ Spec ^ C ^ Spec 

Via the Curry-Howard isomorphism, 77 is a function that takes S, C, a proof of 
Comp(S) = OK(C) and a proof of S' |= Spec, and returns a proof of C ]= Spec. A 
certifying compiler of the proof-preserving kind can then be defined as follows: 

CComp{S : Source, -ks : S \= Spec) = 
match Comp{S) with 
I Error — » Error 
I OK(C) ^ OK(C, n SC -Keq-ns) 

(Here, ixeq is a proof term for the proposition Comp(S) — OK(C), which trivially holds 
in the context of the match above. Actually building this proof term in Coq requires 
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additional baggage in the definition above that we omitted for simphcity.) The accom- 
panying client-side checker is the Coq proof checker. While the certificate produced 
by GComp is huge (it contains a proof of soundness for the compilation of all source 
programs, not just for S), it could perhaps be specialized for S and C using partial 
evaluation techniques. 

2.3 Composition of compilation passes 

Compilers are naturally decomposed into several passes that communicate through 
intermediate languages. It is fortunate that verified compilers can also be decomposed 
in this manner. 

Let Compi and Comp2 be compilers from languages Li to L2 and I/2 to L3, 
respectively. Assume that the semantic preservation property « is transitive. (This is 
true for all properties considered in section 2.1.) Consider the monadic composition of 
Compi and Comp2- 

Comp{S) = 

match Compi{S) with 
I Error — > Error 

I 0K(/) Comp^il) 

Theorem 2 If the compilers Compi and Comp2 are verified, so is their monadic 
composition Comp. 

2.4 Summary 

The conclusions of this discussion are simple and define the methodology we have 
followed to verify the Compcert compiler back-end. 

1. Provided the target language of the compiler has deterministic semantics, an ap- 
propriate specification for the soundness proof of the compiler is the combination 
of definitions 5 (forward simulation for safe source programs) and 8 (verified com- 
piler), namely 

VS, C,B(f: Wrong, Comp{S) = QK{C) ^ S \^ B ^ C ^ B (i) 

2. A verified compiler can be structured as a composition of compilation passes, as is 
commonly done for conventional compilers. Each pass can be proved sound inde- 
pendently. However, all intermediate languages must be given appropriate formal 
semantics. 

3. For each pass, we have a choice between proving the code that implements this 
pass or performing the transformation via untrusted code, then verifying its results 
using a verified validator. The latter approach can reduce the amount of code that 
needs to be proved. In our experience, the verified validator approach is particu- 
larly effective for advanced optimizations, but less so for nonoptimizing translation 
passes and basic datafiow optimizations. Therefore, we did not use this approach 
for the compilation passes presented in this article, but elected to prove directly 
the soundness of these passes. ^ 

However, a posteriori validation with a verified validator is used for some auxiliary heuris- 
tics such as graph coloring during register allocation (section 8.2) and node enumeration during 
CFG linearization (section 10.2). 
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4. Finally, provided the proof of (i) is carried out in a prover such as Coq that gen- 
erates proof terms and follows the Curry-Howard isomorphism, it is at least theo- 
retically possible to use the verified compiler in a context of proof-carrying code. 



3 Infrastructure 



This section describes elements of syntax, semantics and proofs that are used through- 
out the Compcert development. 



3.1 Programs 



The syntax of programs in the source, intermediate and target languages share the 
following common shape. 

Programs: 



P ::= { vars = id\ = data\; . . . idn = datUn; 

fimcts = idi = Fdi; . . . idn = Fdn,; 
main = id } 

Function definitions: 

Fd ::— internal(_F) | extemal(Jls) 

Definitions of internal functions: 

F ::= { sig = sig; body = ...;... } 

Declarations of external functions: 

Fe '■'■—{ tag = id; sig — sig } 

Initialization data for global variables: 

data ::= reserve(n) | int8(n) | intl6(n) 

I int32(n) | float32(/) | float64(/) 

Function signatures: 

sig ::= { eirgs = t; res = (r | void)} 

Types: 

T ::= int 

I float 



global variables 

functions 
entry point 



(language-dependent) 



integers and pointers 
fioating-point numbers 



A program is composed of a list of global variables with their initialization data, a 
list of functions, and the name of a distinguished function that constitutes the program 
entry point (like main in C). Initialization data is a sequence of integer or floating-point 
constants in various sizes, or reserve (n) to denote n bytes of uninitialized storage. 

Two kinds of function definitions Fd are supported. Internal functions F are de- 
fined within the program. The precise contents of an internal function depends on the 
language considered, but include at least a signature sig giving the number and types 
of parameters and results and a body defining the computation (e.g. as a statement in 
Cminor or a list of instructions in PPC). An external function Fe is not defined within 
the program, but merely declared with an external name and a signature. External 
functions are intended to model input /output operations or other kinds of system 
calls. The observable behavior of the program will be defined in terms of a trace of 
invocations of external functions (see section 3.4). 
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The types r used in function signatures and in other parts of Compcert are ex- 
tremely coarse: we only distinguish between integers or pointers on the one hand (type 
int) and floating-point numbers on the other hand (type float). In particular, we 
make no attempt to track the type of data pointed to by a pointer. These "types" are 
best thought of as hardware register classes. Their main purpose is to guide register 
allocation and help determine calling conventions from the signature of the function 
being called. 

Each compilation pass is presented as a total function transf : Fi (0K(_F2) I 
Error(ms(;)) where Fi and F2 arc the types of internal functions for the source and 
target languages (respectively) of the compilation pass. Such transformation functions 
are generically extended to function definitions by taking transf (Fe) = OK(Fe), then 
to whole programs as a monadic "map" operation over function definitions: 

transf (P) = OK {vars = P.vars; functs = (• • • idj = Fd[ \ . . .); main = P.main} 

if and only if P.functs = (. . . idi = Fdi; . . .) and transf (Pd,) = DK(Pd^) for all i. 



3.2 Values and memory states 



The dynamic semantics of the Compcert languages manipulate values that are the 
discriminated union of 32-bit integers, 64-bit IEEE double precision fioats, pointers, 

and a special undef value denoting in particular the contents of uninitialized memory. 
Pointers are composed of a block identifier h and a signed byte offset 5 within this 
block. 



Values: 



Memory blocks: b G 
Block offsets: 5 ::= n 



int(n) 
float (/) 
ptr(6,5) 
undef 



32-bit machine integer 
64-bit floating-point number 
pointer 

block identifiers 

byte offset within a block (signed) 



Values are assigned types in the obvious manner: 

int(n) : int float(/) : float ptr(6, 5) : int midef : r for all r 

The memory model used in our semantics is detailed in [59]. Memory states M are 
modeled as collections of blocks separated by construction and identified by (math- 
ematical) integers b. Each block has lower and upper bounds C{M,b) and 7i(M, fe), 
fixed at allocation time, and associates values to byte offsets 5 £ [£.{M,b),H{M,b)). 
The basic operations over memory states are: 

— aJ.loc{M,l,h) = {b,M'): allocate a fresh block with bounds [l,h), of size (h — I) 
bytes; return its identifier b and the updated memory state M' . 

— store(M, K, 6, (5, w) — [M'J: store value v in the memory quantity k of block 6 at 
offset 5; return update memory state M' . 

— load(M, K, b, 5) = [v} : read the value v contained in the memory quantity k of 
block 6 at offset 5. 

— free(M, h) = M': free (invalidate) the block b and return the updated memory M' . 
The memory quantities k involved in load and store operations represent the kind, 

size and signedness of the datum being accessed: 
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Memory quantities: k ::= intSsigned | intSunsigned | intl6signed 
I intl6unsigned | int32 | float32 | float64 



The load and store operations may fail when given an invaUd block b or an out- 
of-bounds offset S. Therefore, they return option types, with \y\ (read: "some v") 
denoting success with result v, and (read: "none") denoting failure. In this particular 
instance of the memory model of [59], alloc and free never fail. In particular, this 
means that we assume an infinite memory. This design decision is discussed further in 
section 17.4. 

The four operations of the memory model satisfy a number of algebraic prop- 
erties stated and proved in [59]. The following "load-after-store" property gives 
the general flavor of the memory model. Assume store (Mi, /t, 6, 5, v) = [M2} and 
load(Mi,/t',6',5') = [v']. Then, 



The cast(w, k') function performs truncation or sign-extension of value v as prescribed 
by the quantity k' . Note that undef is returned (instead of a machine-dependent value) 
in cases where the quantities k and used for writing and reading disagree, or in cases 
where the ranges of bytes written [6, S+\it\) and read [5' , S' + \it'\) partially overlap. This 
way, the memory model hides the ondianncss and bit-level representations of integers 
and floats and makes it impossible to forge pointers from sequences of bytes [59, section 



3.3 Global environments 

The Compcert languages support function pointers but follow a "Harvard" model where 
functions and data reside in different memory spaces, and the memory space for func- 
tions is read-only (no self- modifying code). We use positive block identifiers 6 to refer to 
data blocks and negative b to refer to functions via pointers. The operational semantics 
for the Compcert languages arc parameterized by a global environment G that docs not 
change during execution. A global environment G maps function blocks 6 < to func- 
tion definitions. Moreover, it maps global identifiers (of functions or global variables) 
to blocks b. The basic operations over global environments are: 

— funct(G',6) = [Fd}: return the function definition Fd corresponding to the block 
& < 0, if any. 

— symbol{G, id) = [6J: return the block b corresponding to the global variable or 
function name id, if any. 

— globalenv(P) = G: construct the global environment G associated with the pro- 
gram P. 

— initiiiem(P) = M: construct the initial memory state M for executing the program 
P. 

The globalenv(P) and initmem(P) functions model (at a high level of abstraction) 

the operation of a linker and a program loader. Unique, positive blocks 6 are allocated 
and associated to each global variable {id = data*) of P, and the contents of these 




7]- 
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blocks arc initialized according to data* . Likewise, unique, negative blocks h are asso- 
ciated to each function definition {id = Fd) of P. In particular, if the functions of P 
have unique names, the following equivalence holds: 

{id,Fd) G P.functs <;=^ 36 < 0. symbol(globalenv(P), id) = [6J 

Afimct(globalenv(P),6) = [Fd\ 

The allocation of blocks for functions and global variables is deterministic so that 
convenient commutation properties hold between operations on global environments 
and per- function transformations of programs as defined in section 3.1. 

Lemma 1 Assume transf (P) = OK(P'). 

— initmem(P') = initmem(P). 

— // symbol (globalenv (P), id) = [&J, then symbol (globalenv(P'), id) = \b\. 

— // funct (globalenv (P), 6) = [Fdj, then there exists a function definition Fd' such 
that funct(globalenv(P'),&) = [Fd'j and transf (Pd) = OK(Pd'). 



3.4 Traces 

We express the observable behaviors of programs in terms of traces of input-output 
events, each such event corresponding to an invocation of an external function. An 

event records the external name of the external function, the values of the arguments 
provided by the program, and the return value provided by the environment (e.g. the 
operating system). 



Events: 


v : 


:= id{vt, H- » Vv) 




Event values: 


Vv ■■ 


:= int(n) | float(/) 




Traces: 


t : 


■- e 1 u.t 


finite traces (inductive) 




T : 


■- € 1 l^.T 


finite or infinite traces (coinductive) 


Behaviors: 


B : 


~ converges (i, n) 


termination with trace t and exit code n 






1 diverges (T) 


divergence with trace T 






1 goeswrong(t) 


going wrong with trace t 



We consider two types of traces: finite traces t for terminating or "going wrong" 
executions and finite or infinite traces T for diverging executions. Note that a diverging 
program can generate an empty or finite trace of input-output events (think infinite 
empty loop). 

Concatenation of a finite trace t and a finite trace t' or infinite trace T is written 
t.t' or t.T. It is associative and admits the empty trace e as neutral element. 

The values that are arguments and results of input-output events are required to 
be integers or floats. Since external functions cannot modify the memory state, passing 
them pointer values would be useless. Even with this restriction, events and traces 
can still model character-based input-output. We encapsulate these restrictions in the 
following inference rule that defines the effect of applying an external function Pe to 
arguments v. 
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V and V are integers or floats 
V and V agree in number and types with Fe.sig 
t = Fe.tag{v I— > v) 

\- Fe{v) A V 

Note that the result value v and therefore the trace t are not completely determined 
by this rule. We return to this point in section 13.3. 



3.5 Transition semantics 

The operational semantics for the source, target and intermediate languages of the 
Compcert back-end are defined as labeled transition systems. The transition relation 
for each language is written G \- S ^ S' and denotes one execution step from state 
S to state 5' in global environment G. The trace t denotes the observable events 
generated by this execution step. Transitions corresponding to an invocation of an 
external function record the associated event in t. Other transitions have t = e. In 
addition to the type of states S and the transition relation G \- S —^ S' , each language 
defines two predicates: 

— initial(P, 5): the state S is an initial state for the program P. Typically, S cor- 
responds to an invocation of the main function of P in the initial memory state 
initmem(P). 

— final (5,72): the state S* is a final state with exit code n. Typically, this means 
that the program is returning from the initial invocation of its main function, with 
return value int(n). 

Executions are modeled classically as sequences of transitions from an initial state to 
a final state. We write G h S S' to denote one or several transitions (transitive 
closure), G \- S — >* S' to denote zero, one or several transitions (reflexive transitive 
closure) , and G h 5 ^ oo to denote an infinite sequence of transitions starting with S. 
The traces t (finite) and T (finite or infinite) are formed by concatenating the traces 
of elementary transitions. Formally: 

GhS^S' GhS' ^* S" 

G\-S^*S — 

GhS *i4" s" 

G\-S^S' G\-S' ^* S" G\- S-^S' G h 5' ^ oo 

G\-S S" 

As denoted by the double horizontal bar, the inference rule defining G h S* oo is 
to be interpreted coinductively, as a greatest fixpoint. The observable behavior of a 
program P is defined as follows. Starting from an initial state, if a finite sequence 
of reductions with trace t leads to a final state with exit code n, the program has 
observable behavior converges (t, n). If an infinite sequence of reductions with trace 
T is possible, the observable behavior of the program is diverges (T). Finally, if the 
program gets stuck on a non-final state after performing a sequence of reductions with 
trace t, the behavior is goeswrong(t). 



G h S *4' CO 
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external function 



other instructions 
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starts 



function 



internal 




instruction 
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empty Program 
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call instruction 



non-empty call stack 



Fig. 3 Transitions between the three kinds of program states. 

initial(P, S) globalenv(P) \- S ^* S' f inal(S'', n) 
P -IJ- converges (f, n) 
irLitial(P, S) globalenv(P) h 5 ^ oo 
P \y cliverges(r) 

iiiitialCP,^) globaleiiv(P) h 5 ^* 5' 5' Vn, -.final(S'',n) 

P -IJ- goeswrorLg(t) 

The set of "going wrong" behaviors is defined in the obvious manner: Wrong = 
{goeswrong(t) | t a finite trace}. 

3.6 Program states 

The contents of a program state vary from language to language. For the assembly 
language PPC, a state is just a pair of a memory state and a mapping from processor 
registers to values (section 13.2). For the other languages of the Compcert back-end, 
states come in three kinds written S, C and TZ. 

— Regular states S correspond to an execution point within an internal function. They 
carry the function in question and a program point within this function, possibly 
along with additional language-specific components such as environments giving 
values to function-local variables. 

— Call states C materialize parameter passing from the caller to the callee. They carry 
the function definition Fd being invoked and either a list of argument values or an 
environment where the argument values can be found at conventional locations. 

— Return states TZ correspond to returning from a function to its caller. They carry 
at least the return value or an environment where this value can be found. 

All three kinds of states also carry the current memory state as well as a call stack: a 
list of frames describing the functions in the call chain, with the corresponding program 
points where execution should be resumed on return, possibly along with function-local 
environments. 

If we project the transition relation on the three-element set {S,C, TZ}, abstracting 
away the components carried by the states, we obtain the finite automaton depicted in 

figure 3. This automaton is shared by all languages of the Compcert back-end except 
PPC, and it illustrates the interplay between the three kinds of states. Initial states 



16 



are call states with empty call stacks. A call state where the called function is external 
transitions directly to a return state after generating the appropriate event in the 
trace. A call state where the called function is internal transitions to a regular state 
corresponding to the function entry point, possibly after binding the argument values 
to the parameter variables. Non-call, non-return instructions go from regular states to 
regular states. A non-tail call instruction resolves the called function, pushes a return 
frame on the call stack and transitions to the corresponding call state. A tail call is 
similar but does not push a return frame. A return instruction transitions to a return 
state. A return state with a non-empty call stack pops the top return frame and moves 
to the corresponding regular state. A return state with an empty call stack is a final 
state. 

3.7 Generic simulation diagrams 

Consider two languages Li and L2 defined by their transition semantics as described 
in section 3.5. Let Pi be a program in Li and P2 a program in L2 obtained by applying 
a transformation to Pi . We wish to show that P2 preserves the semantics of Pi , that 
is. Pi i}. B P2 ij- B for all behaviors B ^ Wrong. The approach we use throughout 
this work is to construct a relation Si ~ S2 between states of Li and states of L2 
and show that it is a forward simulation. First, initial states and final states should be 
related by ~ in the following sense: 

— Initial states: if initial (Pi, Si) and initial(P2, S2), then Si ~ S2. 

— Final states: if Si ~ S2 and final(Si,n), then final(S2,n). 

Second, assuming Si ~ S2, we need to relate transitions starting from Si in Li with 
transitions starting from S2 in L2. The simplest property that guarantees semantic 
preservation is the following lock-step simulation property: 

Definition 10 Lock-step simulation: if Si ~ S2 and Gi h Si — * Sj, there exists S2 

such that G2 I- S2 ^ S2 and Si ~ S2. 

(Gi and G2 are the global environments corresponding to Pi and P2, respectively.) 
Figure 4, top left, shows the corresponding diagram. 

Theorem 3 Under hypotheses "initial states", "final states" and "lock-step simula- 
tion", Pi ii- B and B ^ Wrong imply P2 -IJ- S. 

Proof A trivial induction shows that Si ~ S2 and Gi I- Si ^* Sj implies the existence 
of S2 such that G2 I- S2 — »* S2 and Si ~ S2. Likewise, a trivial coinduction shows 
that Si ~ S2 and Gi h Si — > 00 implies G2 I- S2 — > 00. The result follows from the 
definition of JJ-. 

The lock-step simulation hypothesis is too strong for many program transformations of 
interest, however. Some transformations cause transitions in Pi to disappear in P2, e.g. 
removal of no-operations, elimination of redundant computations, or branch tunneling. 
Likewise, some transformations introduce additional transitions in P2, e.g. insertion of 
spilling and reloading code. Naively, we could try to relax the simulation hypothesis as 
follows: 

Definition 11 Naive "star" simulation: if Si ~ S2 and Gi h Si Sj, there exists 
S^ such that G2 h S2 ^* S^ and S'l - S^. 
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Fig. 4 Four kinds of simulation diagrams that imply semantic preservation. Solid lines denote 
hypotheses; dashed lines denote conclusions. 



This hypothesis suffices to show the preservation of terminating behaviors, but docs 
not guarantee that diverging behaviors are preserved because of the classic "infinite 
stuttering" problem. The original program Pi could perform infinitely many silent 
transitions Si S2 ■ ■ ■ ^ Sn ■ ■ ■ while the transformed program P2 is stuck 
in a state 5' such that Si ^ S' for all i. In this case, Pi diverges while P2 does not, 
and semantic preservation does not hold. To rule out the infinite stuttering problem, 
assume we are given a measure |5i | over the states of language Li. This measure ranges 
over a typo A4 equipped with a well-founded ordering < (that is, there arc no infinite 
decreasing chains of elements of We require that the measure strictly decreases 
in cases where stuttering could occur, making it impossible for stuttering to occur 
infinitely. 

Definition 12 "Star" simulation: if 5i ~ S'2 and Gi \- Si S'l, either 

1. there exists S'2 such that G2 l~ S2 — S2 and S'l ~ S^, 

2. or \Si\ < \Si\ and there exists S'2 such that G2 l~ S2 — >* S2 and S'l ^ S2. 

Diagrammatically, this hypothesis corresponds to the bottom left part of figure 4. 
(Equivalently, part 2 of the definition could be replaced by "or \Si\ < \Si\ and t = e 
and S2 ~ S[" , but the formulation above is more convenient in practice.) 

Theorem 4 Under hypotheses "initial states", "final states" and "star simulation", 
Pi B and B ^ Wrong imply P2 ij- B. 

Proof A trivial induction shows that Si ~ S2 and Gi h Si -*^* S'l implies the existence 
of S2 such that G2 I .S'2 S2 and Sj ~ S2. This implies the desired result if B is 
a terminating behavior. For diverging behaviors, we first define (coinductively) the 
following "measured" variant of the G2 I- S2 — > 00 relation: 

G2hS2^+S^, G2hS^,At'^oo 

t T 

G2 I- S2, ^ 00 
G2 l~ S2 S2 fl' < H G2 l~ S2, yu' ^ 00 



t T 

G2 I- S2,/i ^ 00 
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The second rule permits a number of potentially stuttering steps to be taken, provided 
the measure /j strictly decreases. After a finite number of invocations of this rule, it 
becomes non applicable and the first rule must be applied, forcing at least one transition 

to be taken and resetting the measure to an arbitrarily-chosen value. A straightforward 
coinduction shows that Gi h 5i — > oo and Si ~ 5*2 implies G2 h S'2, lSi| ^ 00. To 

T T 

conclude, it suffices to prove that G2 ^ S2, IJ- 00 implies G2 I- S2 ^ 00. This follows 
by coinduction and the following inversion lemma, proved by Noetherian induction 
over fi: if G2 h S2,A* — > 00, there exists S2, n' , t and T' such that G2 I- S2 — > S2 and 
G2 I- S'2, /i' ^ 00 and T = t.T'. 

Here are two stronger variants of the "star" simulation hypothesis that are convenient 
in practice. (See figure 4 for the corresponding diagrams.) 

Definition 13 "Plus" simulation: if Si ~ S2 and Gi I- Si — > Si, there exists S2 such 
that G2 I- S2 ^+ S'2 and Si ~ S^. 

Definition 14 "Option" simulation: if Si ~ S2 and Gi h Si Si, either 

1. there exists S2 such that G2 I- S2 S2 and S'l ~ S2, 

2. or |S(| < |Si| and t = € and S( ~ S2. 

Either simulation hypothesis implies the "star" simulation property and therefore se- 
mantic preservation per theorem 4. 

4 The source language: Cminor 

The input language of our back-end is called Cminor. It is a simple, low-level imperative 
language, comparable to a stripped-down, typeless variant of C. Another source of 
inspiration was the C — intermediate language of Peyton Jones et al. [81]. In the 
CompCert compilation chain, Cminor is the lowest-level language that is still processor 
independent; it is therefore an appropriate language to start the back-end part of the 
compiler. 

4.1 Syntax 

Cminor is, classically, structured in expressions, statements, functions and whole pro- 
grams. 

Expressions: 

a ::= id 

I est 

I opi(ai) 
I op2{ai,a2) 
I K[ai] 

I ai ? a2 : as 

Constants: 

est ::= n \ f 

I addrsymbol(id) 
I addrstack((5) 



reading a local variable 
constant 

unary arithmetic operation 
binary arithmetic operation 
memory read at address ai 
conditional expression 

integer or fioat literal 
address of a global symbol 
address within stack data 
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Unary operators: 

opi ::= negint | notint | notbool 
I negf I absf 

I castSu I castSs | castl6u | castl6s 

I singleof float 

I intoffloat | intuoffloat 

I f loatof int | f loatof intu 

Binary operators: 

op2 ::= add | sub | mul | div | divu | mod | modu 
I and I or | xor | shl | shr | shru 
I addf I subf | mulf | divf 
I cmp(c) I cmpu(c) | cmpf (c) 



Comparisons: 
c : 



integer arithmetic 
float arithmetic 
zero and sign extensions 
float truncation 
float-to-int conversions 
int-to-float conversions 

integer arithmetic 

integer bit operation 
float arithmetic 
comparisons 



:= eq I ne I gt I ge I It I le 



Expressions are pure: side-eflfecting operations such as assignment and function 
caUs are statements, not expressions. All arithmetic and logical operators of C are 
supported. Unlike in C, there is no overloading nor implicit conversions between types: 
distinct arithmetic operators arc provided over integers and over floats; likewise, explicit 
conversion operators are provided to convert between floats and integers and perform 
zero and sign extensions. Memory loads are explicit and annotated with the memory 
quantity k being accessed. 



Statements: 
s 



:= skip 



Switch tables 
tbl :: 



ffl2 



K[ai 
id/ — a{a) : sig 
tailcall a{a) : sig 
return (a' ) 

if (a) {si} else {52} 

loop{si} 

block {si} 

exit(n) 

switch(a) {tbl} 

Ibl : s 

goto Ibl 

= default : exit(n) 
I case i : exit(n); tbl 



no operation 

assignment to a local variable 

memory write at address ai 

function call 

function tail call 

function return 

sequence 

conditional 

infinite loop 

block delimiting exit constructs 
terminate the (n + l)**^ enclosing block 
multi-way test and exit 
labeled statement 
jump to a label 



Base statements are skip, assignment id = a to a. local variable, memory store 
n[ai] = a2 (of the value of 02 in the quantity k at address ai), function call (with op- 
tional assignment of the return value to a local variable), function tail call, and function 
return (with an optional result). Function calls are annotated with the signatures sig 
expected for the called function. A tail call tailcall a{a) is almost equivalent to a 
regular call immediately followed by a return, except that the tail call deallocates the 
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"average" (arr, sz) : int, int -> float 
{ 

vars s, i; stacksize 0; 
s = 0.0; i = 0; 
block { loop { 

if (i >= sz) exit(O) ; 

s = s +f floatof int(int32[arr + i*4]); 
i = i + 1; 
} } 

return s /j floatof int (sz) ; 

} 

Fig. 5 An example of a Cminor function (riglit) and tlie corresponding C code (left). 

current stack data block before invoking the function. This enables tail recursion to 
execute in constant stack space. 

Besides the familiar sequence si;s2 and if /then/else constructs, control flow can 
be expressed either in an unstructured way using goto and labeled statements or in a 
structured way using infinite loops and the block/exit construct, exit(n) where n > 
branches to the end of the (n + 1)**^ enclosing block construct. The switch(a) {tbl} 
construct matches the integer value of a against the cases listed in tbl and performs the 
corresponding exit. Appropriate nesting of a switch within several block constructs 
suffices to express C-like structured switch statements with fall-through behavior. 

Internal functions: F ::= { sig = sig; function signature 

params = id; parameters 

vars = id; local variables 

stacksize = n; size of stack data in bytes 

body = s } function body 

In addition to a parameter list, local variable declarations and a function body 
(a statement), a function definition comprises a type signature sig and a declaration 
of how many bytes of stack-allocated data it needs. A Cminor local variable does not 
reside in memory, and its address cannot bo taken. However, the Cminor producer can 
explicitly stack-allocate some data (such as, in C, arrays and scalar variables whose 
addresses are taken). A fresh memory block of size F.stacksize is allocated each time F 
is invoked and automatically freed when it returns. The addrstack((5) nuUary operator 
returns a pointer within this block at byte ofi^set 5. 

Figure 5 shows a simple C function and the corresponding Cminor function, using 
an ad-hoc concrete syntax for Cminor. Both functions compute the average value of 
an array of integers, using float arithmetic. Note the explicit address computation 
int32[tbl + i*4] to access clement i of the array, as well as the explicit floatof int 
conversions. The for loop is expressed as an infinite loop wrapped in a block, so that 
exit(O) in the Cminor code behaves like the break statement in C. 

4.2 Dynamic semantics 

The dynamic semantics of Cminor is defined using a combination of natural seman- 
tics for the evaluation of expressions and a labeled transition system in the style of 
section 3.5 for the execution of statements and functions. 



double average (int arr[], int sz) 
{ 

double s; int i; 

for (i = 0, s = 0; i < sz; i++) 

s += arr [i] ; 
return s / sz; 

} 
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E{id) = lv\ eval-constant (G, cr, est) = [y] 



G,a,E,M\-id^v G,a, E, M ^ est ^ v 

G,(T,E,M I- ai => eval_unop(opi, vi) = \y\ 

G,<j,E,M h opi(ai) ?; 
G, cr, S, M h oi ui G,cr, E, M \- a2 ^ V2 eval_binop(op2i''li''2) = M 

G,a,E,M h op2(ai,a2) ^ i) 

G, <T, E, M I- ai ptr(fe, <5) load(M, k, b, 5) = [uj 

G,a;E,M \- K[ai] 

(t. (J. h . M hdi !'i istrue(ri ) (j.a.E.M h a2 ^ V2 

G, a,E,M h (ai ? a2 : 03) => 1)2 

G, <T, E,M h ai ^ t)i isfalse(vi) G,i7, E, M \- as ^ V3 

G, a,E,M h (ai ? a2 ■ as) ^ 1^3 

G,a,E,M \- V G,<j,E,M \- a ^ v 
G,a,E,M h e: e 

Evaluation of constants: 

eval_constant(G, (T, j) = [int(j)J 
eval_constant(G, o", /) = [float(/)J 
eval_constant(G, (T, addrsymbol(i£i)) = symbol (G, ia!) 
eval_constant(G, cr, addrstack((5)) = [ptr(<T, <5)J 

Evaluation of unary operators (selected cases): 

eval_unop(negf , f loat(/)) = [float(— /)J 

eval_unop(notbool, f ) = [int(0)J if istrue(ii) 
eval_unop(notbool, = Lint(l)J if isfalse(i>) 

Evaluation of binary operators (selected cases) : 

eval_binop(add, int(ni), int(n2)) = Lint(ni+n2)J (mod 2^^) 
eval_binop(add, ptr(6, S), int(n)) = Lptr(b, <5 + n)J (mod 2^^) 
eval.binop(addf,float(/i),float(/2)) = Lfloat(/i + /2)J 

Truth values: 

istrue(v) v is ptr(6, 5) or int(n) with n 7^ 
def 

isfalse(i;) = is int(O) 



Fig. 6 Natural semantics for Cminor expressions. 



4-2.1 Evaluation of expressions 

Figure 6 defines the big-step evaluation of expressions as the judgment G, a,E,M h 
a => V, where a is the expression to evaluate, v its value, a the stack data block, E 
an environment mapping local variables to values, and M the current memory state. 
The evaluation rules arc straightforward. Most of the semantics is in the definition of 
the auxiliary functions eval_constaiit, eval_miop and eval_binop, for which some 
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representative cases are shown. Those functions can return 0, causing the expression 
to be undefined, if for instance an argument of an operator is undef or of the wrong 
type. Some operators (add, sub and cmp) operate both on integers and on pointers. 

Jt-Z.'S, Execution of statements and functions 

The labeled transition system that defines the small-step semantics for statements and 
function invocations follows the general pattern shown in sections 3.5 and 3.6. Program 
states have the following shape: 

Program states: S ::= S{F, s, k, a, E, M) regular state 

\C{Fd,v,k,M) call state 

I 7?.(v, k, M) return state 

Continuations: k ::= stop initial continuation 

I s; fc continue with s, then do as k 

I endblock(fc) leave a block, then do as k 

I retumto(i(i^, F, cr, E, k) return to caller 

Regular states S carry the currently-executing function F, the statement under con- 
sideration s, the block identifier for the current stack data a, and the values E of local 
variables. 

Following a proposal by Appel and Blazy [3] , we use continuation terms k to encode 
both the call stack and the program point within F where the statement s under 
consideration resides. A continuation k records what needs to be done once s reduces 
to skip, exit or return. The retumto parts of k represent the call stack: they record 
the local states of the calling functions. The top part of k up to the first retumto 
corresponds to an execution context for s, represented inside-out in the style of a 
zipper [45]. For example, the continuation s; endblock(. . .) corresponds to the context 
block {[];s}. 

Figures 7 and 8 list the rules defining the transition relation G h S ^ S' . The 
rules in figure 7 address transitions within the currently-executing function. They are 
roughly of three kinds: 

— Execution of an atomic computation step. For example, the rule for assignments 
transitions from id = o to skip. 

— Focusing on the active part of the current statement. For example, the rule for 
sequences transitions from (si;S2) with continuation fc to si with continuation 

— Resuming a continuation that was set apart in the continuation. For instance, 
one of the rules for skip transitions from skip with continuation s; fc to s with 

continuation k. 

Two auxiliary functions over continuations are defined: callcont(fc) discards the local 
context part of the continuation k, and f indlabel(ZfeZ, s, fc) returns a pair (s', fc') of the 
leftmost sub-statement of s labeled Ibl and of a continuation fc' that extends fc with the 
context surrounding a' . The combination of these two functions in the rule for goto 
suffices to implement the branching behavior of goto statements. 

Figure 8 lists the transitions involving call states and return states, and defines 
initial states and final states. The definitions follow the general pattern depicted in fig- 
ure 3. In particular, initial states are call states to the "main" function of the program, 
with no arguments and the stop continuation; symmetrically, final states are return 
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G h S{F, skip, (s; k), a, E, M) S{F, s, k, a, E, M) 
G h S{F, skip, endblock(A;), c, E, M) A S(F, skip, k, a, E, M) 

G,a,E,M \- V 

G V- S(F, (id = a), k, cr, E, M) S{F, skip, k, a, E{id ^ v}, M) 

G,a,E,M h ai ^ ptT{b,5) G,(7, E, M h a2 ^ v store(M, k, 6, 5, ti) = [M'J 

G h S{F, (K[ai] = [a2]), k, a, E, M) A S{F, skip, fc, cr, E, M') 

G h 5(F, (si ; S2), k, a, E, M) A S{F, si, (sa; fe), ct, B, M) 

G,cr,E,M \- a ^ V istrue(t)) 

G l-5(F,(if{a){si} else {52}), fc, £, M) A 5(F, si, fe, ct, E, M) 

G,a,E,M h a ^ V isfalse(i)) 

GI-5(F, (if(a){si} else {52}), fc, cr, £, M) ^ S(F, S2, fc, cr, E, M) 

G h S{F, loop{s}, fc, (T, E, M) A 5(F, s, (loop{s}; k),a, E, M) 
G h S{F, block{s}, k, a, E, M) S{F, s, endblock(fe), ct, E, M) 
G h 5(F, exit{n), (s; fe), (t, B, M) S{F, exit(n), k, cr, E, M) 
G h 5(F, exit(O), endblock(fc), a, E, M) A S{F, skip, fe, ct, E, M) 
G h S(F, exit(n + 1), endblock(fe), tr, E, M) A S{F, exit(n), fe, a, E, M) 

G,a,E,M \- int(n) 

G h 5(F, {switch{a){t6«}), A:, cr, E, M) A exit{i6Z(n)), fc, cr, E, M) 

G h 5(F, ((6/ : s), fc, cr, S, M) A <S(F, s, fc, cr, E, M) 

findlabel(i6/,F.body, callcoiit(fc)) = [s',fc'J 

G h 5(F, goto fc, (T, E, M) S{F, s',k', a, E, M) 

callcont(s;fc) = callcont(fc) callcont(endblock(fe)) = callcont(fe) 
callcont(fe) = fe otherwise 

f indlabel((6i, (si; S2), fc) 

f indlabel(Z6Z, if (a){si} else {s2},k) 

f indlabel(/6Z, loop{s}, fc) 
findlabel(Z6;,block{s}, fc) 

findlabel(«6i, (Ibl : s), fc) 
f indlabel(Z6/, (Ibl' : s), fc) 

Fig. 7 Transition semantics for Cminor, part 1: statements. 



_ ('findlabel(Z6;,si,(s2;fc)) if not 0; 
1^ f indlabel(Z6i, S2, fc) otherwise 

_ r findlabel(/5i,si,fc) if not 0; 
1_ f lndlabel(Z5(, S2i k) otherwise 

= f indlabel(Z6Z, s, (loop{s}; fc)) 

= f indlabel(Z6(, s, endblock(fc)) 

= L^, fcj 

= findlabel(/6Z,s,fc) if Ibl' ^ Ibl 



states with the stop continuation. The rules for function calls require that the signa- 
ture of the called function matches exactly the signature annotating the call; otherwise 
execution gets stuck. A similar requirement exists in the C standard and is essential 
to support signature-dependent calling conventions later in the compiler. Taking again 

a leaf from the C standard, functions arc allowed to terminate by return without an 
argument or by falling through their bodies only if their return signatures are void. 
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G,a,E,M h ai ^ ptr{b,0) G,fr,E,M h a ^ v funct(G, fe) = [FdJ Fd.sig = sig 
G h S(F, (id- = ai (a) : sig), k, a, E, M) A C{Fd, v, returnto(i(i'' , F, a, E, k), M) 
G, (7, _E, M h ai ptr(fe, 0) G,u,E,M\-a^v fuiict(G, b) = [FdJ Fd.s±g = sig 
G h 5(F, (tailcall oi(a) : sig),k,a,E,M) A C(Fd, callcont(fe), iW) 

F. sig.res = void k = returnto(. . .) or k = stop 
G h S{F, skip, fe, (T, M) A 7^(undef , fc, free(M, ct)) 

F.sig.res = void 
G h S{F, return, fc, <t, _E, M) A 7^(undef , callcont(fc), f ree(M, cr)) 

F.sig.res 7^ void G,ij, E, M h a ^ v 
G h 5(F, return(a), fc, (t, E, M) 7^(^;, callcont(A;), f ree(M, a)) 
alloc(M, 0, F.stackspace) = (cr, M') E = [F.params ^ v; F.vars ^ undef ] 
G l-C(internal(F),t/, fe, Af) A F.body, fc, ct, E, M') 

h Fe(v) =l> ti (see section 3.4) 
G h C(extemal(Fe), jT, k, M) ^ Tl{v, k, M) 
G h n{y, returnto(id- , F, a, E, k),M) S{F, skip, k, a, B{id- <- v}, M) 
symbol (global env(P),P.main) = \ h\ fviiict(globalenv(P), 6) = \_Fd\ 
initial(P, C(F(i, e, stop, initmeni(P))) 
f inal(7^(iiit(n), stop, M), n) 

Fig. 8 Transition semantics for Cminor, part 2: functions, initial states, final states. 
4-2.3 Alternate natural semantics for statements and functions 

For some applications, it is convenient to have an alternate natural (big-step) opera- 
tional semantics for Cminor. We have developed such a semantics for the fragment of 
Cminor that excludes goto and labeled statements. The big-step judgments for termi- 
nating executions have the following form: 

G,a \- s,E,M ^ out, E', M' (statements) 
G\- Fdiv),M Av,M' (function calls) 

E' and M' are the local environment and the memory state at the end of the execution; 
t is the trace of events generated during execution. Following Huisman and Jacobs [46], 
the outcome out indicates how the statement s terminated: either normally by running 
to completion {out — Normal); or prematurely by executing an exit statement {out = 
Exit(n)), return statement {out = Return(u') where v' is the value of the optional 
argument to return), or tailcall statement {out = Tailreturn(t;)). Additionally, we 
followed the coinductive approach to natural semantics of Leroy and Grail [60] to define 
(coinductively) big-step judgments for diverging executions, of the form 

G, cr h s, _B, M 00 (diverging statements) 

T 

G h Fd{v), M => 00 (diverging function calls) 
The definitions of these judgments can be found in the Coq development. 
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Theorem 5 The natural semantics of Cminor is correct with respect to its transition 
semantics: 

1. IfGV- Fd{v), M Av, M', then G h C{Fd, v, k, M) ^* n{v, k, M') for all continu- 
ations k such that k = callcont(fc). 

2. IfGV- Fd{v),M ^ oo, then G h C{Fd, v, k, M) —>■ oo for all continuations k. 



4.3 Static typing 

Cminor is equipped with a trivial type system having only two types: int and float. 
(Pointers have static typo int.) Function definitions and function calls are annotated 
with signatures sig giving the number and types of arguments, along with optional 
result types. All operators are monomorphic; therefore, the types of local variables can 
be inferred from their uses and are not declared. 

The primary purpose of this trivial type system is to facilitate later transformations 
(see sections 8 and 12); for this purpose, all intermediate languages of Compcert are 
equipped with similar int-or-f loat type systems. By themselves, these type systems 
are too weak to give type soundness properties (absence of run-time type errors). 
For example, performing an integer addition of two pointers or two undef values is 
statically well-typed but causes the program to get stuck. Likewise, calling a function 
whose signature differs from that given at the call site is a run-time error, undetected 
by the typo system; its semantics arc not defined and the compiler can (and does) 
generate incorrect code for this call. It is the responsibility of the Cminor producer to 
avoid these situations, e.g. by using a richer type system. Nevertheless, the Cminor 
typo system enjoys a typo preservation property: values of static type int arc always 
integers, pointers or undef, and values of static type float are always floating-point 
numbers or undef. This weak soundness property plays a role in the correctness proofs 
of section 12.3. 



5 Instruction selection 

The first compilation pass of Compcert rewrites expressions to exploit the combined 
arithmetic operations and addressing modes of the target processor. To take better 
advantage of the processor's capabilities, reassociation of integer additions and multi- 
plications is also performed, as well as a small amount of constant propagation. 



5.1 The target language: CminorSel 

The target language for this pass is CminorSel, a variant of Cminor that uses a dif- 
ferent, processor-specific set of operators. Additionally, a syntactic class of condition 
expressions ce (expressions used only for their truth values) is introduced. 
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Expressions: 

a ::= id 

I op{a) 

I load(K;, mode, a) 
\ ce ? ai : a2 
Condition expressions: 

ce ::= true | false 
I cond{a) 
I cei ? 062 ■ cea 

Operators (machine-specific): 

op ::= n \ f \ move | . . . 

I addin I rolm„,TO | • • • 

Addressing modes (machine-specific): 
mode ::= indexed (n) 
I indexed2 
I global(ici, 5) 
I based(irf, 5) 
I stack((5) 

Conditions (machine-specific) : 

cond ::= comp(c) | compimin(c, n) 

I compu(c) I compuimm(c, n) 
I compf (c) 



Statements: 
s 



store(/t, mode, a, a) 
if(ce){si} else {52} 



reading a local variable 
operator application 

memory road 
conditional expression 



elementary test 
conditional condition 

most of Cminor operators 
PPC combined operators 

indexed, immediate displacement 
indexed, register displacement 
address is id + 5 
indexed, displacement is id + 5 
address is stack pointer + 5 

signed integer / pointer comparison 
unsigned integer comparison 
float comparison 

memory write 
conditional statement 
as in Cminor 



For the PowerPC, the machine-specific operators op include all Cminor nuUary, 
unary and binary operators except notint, mod and modu (these need to be synthe- 
sized from other operators) and adds immediate forms of many integer operators, as 
well as a number of combined operators such as not-or, not-and, and rotate-and-mask. 
(rolmn.m is a loft rotation by n bits followed by a logical "and" with m.) A mem- 
ory load or store now carries an addressing mode mode and a Ust of expressions a, 
from which the address being addressed is computed. Finally, conditional expressions 
and conditional statements now take condition expressions ce as arguments instead of 
normal expressions a. 

The dynamic semantics of CminorSel resembles that of Cminor, with the addition of 
a new evaluation judgment for condition expressions G, a,E,M \- ce (false | true). 
Figure 9 shows the main differences with respect to the Cminor semantics. 



5.2 The code transformation 

Instruction selection is performed by a bottom-up rewriting of expressions. For each 
Cminor operator op, we define a "smart constructor" function written op that takes 
CminorSel expressions as arguments, performs shallow pattern-matching over them to 
recognize combined or immediate operations, and returns the corresponding CminorSel 
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Evaluation of expressions: 

G,iT,E,M'ra=^v eval_op(G, a, op,v) = [v\ 
G,a, E,M h op(a) ^ v 
G,a, E, M \- a ^ V eval_mode(G, a, mode, v) = [ptr(6, (5)J load(M, k, b, S) = [v\ 
G, (7, E, M \- load(K, mode, a) u 
G, CT, B, M h c true G,a, E, M \- ai ^ vi 

G, CT,E,M h {c? ai : 02) vi 
G, a, E,M h false G,a,E,M h a2 ^ V2 
G, a,E,M h (c? ai : 02) => V2 
Evaluation of condition expressions: 

G, r7,E,M\- true =^ true G, a,E,M\- false ^ false 

G,(7, E, M \- a ^ V eval_cond(cond, v) = [6J 

G, <T, E, M h cond(a) =J> 6 
G, (T, E, M h cei ^ true G, a,E,M\- ce2 =^ 6 

G, a,E,M\- (cei ? ce2 : ces) 6 
G, CT, E, M h cei ^ false G, (t, E, M h ces ^ 6 
G, (T, E, M h (cei ? ce2 : ces) ^ 6 

Execution of statements: 

G, rj. E. Al ^ a ^ V eval_mode(rT. rj. '/n.odc. v) = [ptr(?^ ())\ 
G. (J, E, M h 11 ^- (■ store(jy, /v. /a r). r) = [-\i'J 

G h 5(F, store(K, mode, a, a), k, a, E, M) A S{F, skip, fe, ct, E, M') 

G, a, E, M h ce ^ true 

Gh5(F, (if(ce){si} else {s2}),k,a,E,M) S{F,si,k,<j,E,M) 

G, a, E, M \- ce ^ false 

GI-5(F, (if(ce){si} else {s2}),k,a,E,M) S{F,S2,k,a,E,M) 

Fig. 9 Semantics of CminorSel. Only the rules that differ from those of Cminor are shown. 



expression. For example, here are the smart constructor add for integer addition and 
its helper addin for immediate integer addition: 



add(addini (ai), addin2 (02)) 
add(addin(ai), 02) 
add(ai, addin(o2)) 
add(ni,n2) 
iLdd(a'i, a'2) 
addi„i {712) 



-■ addi„i+„2(add(oi,02)) 
: addin (add(ai, 02)) 
: addin(add(ai, 02)) 

: ni + n2 

: add(a'i,a2) otherwise 



n\ + n2 

addini(addira2(a)) = addi„i+„2(a) 

addrsymbol(id + (5 + n)) 
addin (addrstack((5)) = addrstack(J + n) 



addin (addrsymbol (id + 5)) 
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a<idin(a) = ad<iin(a) otherwise 



Here are some cases from other smart constructors that illustrate reassociation of 
immediate multiplication and immediate addition, as well as the recognition of the 
rotate-and-mask instruction: 



mulirTi(addi„(a) 
shl(a, n 



shru(o, n 
aiid(a, n 

rolm„i (rolm„2 ,m2 (a) 
or (rolm„,mi (a) , rolm„,m2 (o) 



addi m X n (mul i m ( a ) ) 



= rolm, 



n,(-l)<n 



(a) 



= rolm32_„_(_i)3>„(a) 
= rolio,„(o) 

= rolm„j+„2^m(a)with m = rol(mi, n2) A m2 
= rolmn,miVm2 (fl) 



While innocuous-looking, these smart constructors are powerful enough to, for instance, 
reduce 8 + (x + 1) x 4 to x x 4 + 12, and to recognize a rolm3^_i(x) instruction for (x 
« 3) I (x » 29) . a C encoding of bit rotation commonly used in cryptography. 

The recognition of conditions and addressing modes is performed by two func- 
tions cond(o) = c and mode(a) = {mode, a). The translation of expressions is, then, a 
straightforward bottom-up traversal, applying the appropriate smart constructors at 
each step: 

lest] = est 
[op2(ai, 02)1 = op2([aiL [02!) 



[/t[o]] = load(K, mode, o) where (mode, a) =mode([a]) 



[01 ? 02 : aaj = cond([oil) ? [as] : [03] 
We omit the translation of statements and functions, which is similar. 



5.3 Semantic preservation 

The first part of the proof that instruction selection preserves semantics is to show the 
correctness of the smart constructor functions. 

Lemma 2 

1. If G,cr,E,M \- ai vi and eval_vmop(op]^, vi) = [v\, then G,(7,E,M h 
opT(ai) ^ V. 

2. If G,a,E,M \- ai => vi and G,a,E,M \- a2 => V2 and eval_binop(op2i'yi: ^^2) = 
[v], then G,(7,E,M h 0^^(01,02) =^ v. 

3. IfG,a,E,M\-a^v and istrue(«), then G,a,E,M h cond(a) ^ true. 
4- If G,a, E, M h a ^ V and isfalse(u), then G, a, E, M h cond(a) => false. 

5. If G,a,E,M h a v and mode(a) = {mode, a), then there exists v such that 
G,a,E,M \- a => V and eval_mode( mode, v) = [v} . 

After copious case analysis on the operators and their arguments and inversion 
on the evaluations of the arguments, the proof reduces to showing that the defining 
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equations for the smart constructors arc valid when interpreted over ground machine 
integers. For instance, in the case for rolm shown above, we have to prove that 

rol((rol(2;, rti) A mi), 112) A m2 ~ rol(x,ni +712) A (rol(mi,n2) A 7712) 

which follows from the algebraic properties of rotate-and-left and bitwise "and" . Com- 
pleting the proof of the lemma above required the development of a rather large and 
difficult formalization of A'^-bit machine integers and of the algebraic properties of their 
arithmetic and logical operations. 

Let P be the original Cminor program and P' be the CminorSel program produced 
by instruction selection. Let G, G' be the corresponding global environments. Semantic 
preservation for the evaluation of expressions follows from lemma 2 by induction on 
the Cminor evaluation derivation. 

Lemma 3 IfG,a,E,M\-a^v, then G' ,ct, E,M \- [aj ^ v. 

The last part of the semantic preservation proof is a simulation argument of the 
form outlined in section 3.7. Since the structure of statements is preserved by the 
translation, transitions in the original and transformed programs match one-to-one, 
resulting in a "lock-step" simulation diagram. The relation ~ between Cminor and 
CminorSel execution states is defined as follows: 

S{F, s, k, a, E, M) ~ S{[Fl [s], [fc], a, E, M) 
C{Fd,v,k,M) C{lFdlv,lklM) 
n{v,k,M) ~ n{v,lk\,M) 

Since the transformed code computes exactly the same values as the original, envi- 
ronments E and memory states M arc identical in matching states. Statements and 
functions appearing in states must be the translation of one another. For continuations, 
we extend (isomorphically) the translation of statements and functions. 

Lemma 4 If G \- Si ^ S2 and Si ~ 5^, there exists S2 such that G' \- S'l ^ S'2 and 
Si ~ 82- 

The proof is a straightforward case analysis on the transition from Si to 82- Se- 
mantic preservation for instruction selection then follows from theorem 3 and lemma 4. 

6 RTL generation 

The second compilation pass translates CminorSel to a simple intermediate language 
of the RTL kind, with control represented as a control-flow graph instead of struc- 
tured statements. This intermediate language is convenient for performing optimiza- 
tions later. 

6.1 The target language: RTL 

The RTL language represents functions as a control-flow graph (CFG) of abstract 
instructions, corresponding roughly to machine instructions but operating over pseudo- 
registers (also called "temporaries"). Every function has an unlimited supply of pseudo- 
registers, and their values are preserved across function call. In the following, r ranges 
over pseudo-registers and I over labels of CFG nodes. 
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RTL instructions: 



RTL control-flow graph: g 
RTL functions: F 



nop(i) 

op(op,f,r,/) 
loacl(K, mode, f, r, I) 
store (k, mode, r, r, I) 
call(si(7, (r | id),f,r,l) 
tailcall(si(;, (r | id),r) 

COnd{cond, f, I true, I false) 

return | return(r) 
I i 

{ sig = sig; 
params = r; 
stacksize = n; 
entrypoint = I; 
code = g} 



no operation (go to /) 
arithmetic operation 
memory load 
memory store 
function call 
function tail call 
conditional branch 
function return 

finite map 
parameters 

size of stack data block 
label of first instruction 
control-flow graph 



Each instruction takes its arguments in a list of pseudo-registers r and stores its result, 
if any, in a pseudo-register r. Additionally, it carries the labels of its possible successors. 

Wc use instructions rather than basic blocks as nodes of the control-flow graph because 
this simplifies semantics and reasoning over static analyses without significantly slowing 
compilation [50]. 

The dynamic semantics of RTL is defined by the labeled transition system shown 

in figure 10. Program states have the following form: 

Program states: S ::= S{S, g,a,l, R, M) regular state 
I C{E, Fd, V, M) caU state 
I TZ{S, V, M) return state 

Call stacks: S ::= {^{r, F, a, I, R))* list of frames 

Register states: R ::= r « 

In regular states, g is the CFG of the function currently executing, / a program point 
(CFG node label) within this function, a its stack data block, and R an assignment of 
values for the pseudo-registers of F. All three states carry a call stack E, which is a 
list of frames T representing pending function calls and containing the corresponding 
per- function state F,a,l, R. 

The transition system in figure 10 is unsurprising. Transitions from a regular state 
discriminate on the instruction found at the current program point. To interpret arith- 
metic operations, conditions and addressing modes, wo rouse the functions eval_op, 
eval_cond and eval_mode of the CminorSel semantics. Other transitions follow the 
pattern described in section 3.6. 



6.2 Relational specification of the translation 

The translation from CminorSel to RTL is conceptually simple: the structured control 
is encoded as a CFG^; expressions are decomposed into sequences of RTL instructions; 

Since RTL currently has no instructions performing N-way branches (i.e. jump tables), 
this translation of control includes tlie generation of binary decision trees for Cminor switch 
statements. We do not describe this part of the translation in this article. 
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9(1) = Lnop(r)J 

G h S{S, g, a, I, R, M) A 5(17, g, cr, /', _R, M) 

g(l) = [op(op, r, r, ;')J eval_op(G, a, op, R(r)) = \y\ 

G\-S{E,g,a,l, R, M) -4 S{E, g, a, I', R{r ^v},M) 

g{l) = [load(K, mode, r, r, Z')J 
eval_niode(G, CT, mode, -R(r)) = [ptr(b, (5)J load(M, k, fe, (5) = [v] 

G\- S{S,g,a,l, R, M) A 5(1:, , (t, R{r ^v},M) 

g{l) = [store(K, mode, r, r, i')J 
eval_mode(G,(T, mode, i?(r)) = [pti:{b,S)\ stoTe{M, K,b, S, R(r)) = [M'J 

G h 5(r , (/. i, i?,, Af) A S{E, g, a, Z', R, M') 

g{l) = [call(,s/f/. i-j,r. !\ li[,rf ) = ptr(7), 0) funct(G', /;) = J'd \ Fd. sig = sig 

G h S{E, g, a, I, R, M) A C(Jf (r, g, a, I', R).E, Fd, R(r), M) 

g{l) = lta.ilca.ll (sig, rf,r)] _R{r/) = ptr(fe, 0) funct(G, 6) = [FdJ Fd.sig = sig 

G h S{E, g, a, I, R, M) ^ C(E, Fd, R{r), ±Tee{M, a)) 

g{l) = lcond{cond,r,ltrue,lfaise)i eval_cond( cond, -R(r)) = [truej 

G h S(E, g, a, I, R, M) S{E, g, a, hme, R, M) 

g{l) = \conA{cond,f,ltr-ue,lfalse)\ eval_cond(cond, = [falsej 

G h S{E, g, <j, I, R, M) A S(E, g, a, lfalse,R, M) 

g(l) = [return] 

G h S{E, g, a, I, R, M) A TLiE, undef , f ree(M, <t)) 

g{l) = [return(r)J 

G h S{E,g,u,l,R, M) 7^(17, i?{r), free (.1/, ct)) 

alloc(M, 0, F.stackslze) = (ct, M') 

G h C(i;, internal (F), v, M) A 5(17, F.code, cj, F.entrypoint, [F.params M') 

h Fe{v) 4- D (see section 3.4) 
G h C(i7, extemal(Fe), v, M) ^ n{E, v, M) 
G h ■R(r(r,g,a,l,R).E,v,M) A S(E,g,a,l,R[r <- i;],M) 
symbol (global env (P), P.maln) = \ b\ funct(globalenv(P), 6) = [FdJ 
initial(P, C(e, Fd, e, Initmeiii(P))) 
final {n{t, int (n) , M) , n) 

Fig. 10 Semantics of RTL. 



pseudo-registers are generated to hold the values of CminorSel variables and intermedi- 
ate results of expression evaluations. The decomposition of expressions is made trivial 
by the prior conversion to CminorSel: every operation becomes exactly one op instruc- 
tion. However, implementing this translation in Coq is delicate: since Coq is a pure 
functional language, we cannot use imperative updates to build the CFG and gen- 
erate fresh pseudo-registers and CFG nodes. Section 6.4 describes a solution, based 
(unsurprisingly) on the use of a monad. 
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Fig. 11 Examples of correspondences between expressions/statements and sub-graphs of a 
CFG. 



In the present section, we give a relational, non-executable specification of the 
translation: syntactic conditions under which a RTL function is an acceptable transla- 
tion of a CminorSel function. The key intuition captured by this specification is that 
each subexpression or substatement of the CminorSel function should correspond to a 
sub-graph of the RTL control-flow graph. For an expression a, this sub-graph is iden- 
tified by a start node li and an end node le- The instructions on the paths from li to 
le should compute the value of expression a, deposit it in a given destination register 
r^, and preserve the values of a given set of registers. For a statement s, the sub- 
graph has one start node but several end nodes corresponding to the multiple ways in 
which a statement can terminate (normally, by exit, or by goto). Figure 11 illustrates 
this correspondence. As depicted there, the sub-graph for a compound expression or 
compound statement such as si; S2 contains sub-graphs for its components si and S2, 
suitably connected. In other words, the relational specification of the translation de- 
scribes a hierarchical decomposition of the CFG along the structure of statements and 
expressions of the original CminorSel code. 

The specification for expressions is the predicate g',7, tt h a in r^^ ~ Zi,Z2, where g 
is the CFG, 7 an injective mapping from CminorSel variables to the registers holding 
their values, n a set of registers that the instructions evaluating a must preserve (in 
addition to those in Rng(7)), a the CminorSel expression, the register where its value 
must be deposited, and li, I2 the start and end nodes in the CFG. The following rules 
give the flavor of the specification: 

g,7,7r h id in ~ 
j{id) = [rj g{li) = [op(move,r,rd,Z2)J U ^ Rng(7) U tt 



5,7,7r h id in ~ /i,/2 
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3,7,71" I- a in f ~ ^(Z) = [op(op, f, r^, «2)J ^ Rng(7) U tt 

t?,7,7r h op{a) in ~h,l2 

5,7,7r h e in e ~ hji 

3,7, TT h a in r ~ 3,7, tt U {r} h o in r ~ /,i2 r ^ Rng(7) U tt 

3,7, TT h a. a in r.r ~ ?i,Z2 

The freshness side-conditions r ^ Rng(7) U tt ensure that the temporary registers used 
to hold the values of subexpressions of a do not interfere with registers holding values of 
CminorSel variables (Rng(7)) nor with temporary registers holding values of previously 
computed subexpressions (e.g. in the expression add(oi,a2), the value of oi during 
the computation of the value of 02). The specification for conditional expressions is 
similar: 3,7,7r h c ~ li,hrue, I falser but with two exit nodes, Itrue and I false- The 
instruction sequence starting at l\ should terminate at Itrue if c evaluates to true and 
at Ifaise if c evaluates to false. Finally, the translation of statements is specified by 
the predicate 3,7 h s ~ li,l2,le,lg,lr,r]., where le is a list of nodes and Ig a mapping 
from CminorSel labels to nodes. The contract expressed by this complicated predicate 
is that the instruction sequence starting at h should compute whatever s computes 
and branch to node I2 if s terminates normally, to le(ri) if s terminates by exit(n), 
to lg{lbl) if s performs goto Ibl, and to l,- if s performs a return (after depositing the 
return value, if any, in register r^)- For simplicity, figure 11 depicts only the I2 final 
node, and the following sample rules consider only the I2 and le final nodes. 

■y{id) = [r^J 3, 7, h a in ~ , h 
3,7 I- {id = o) ~ hMM 
^{id)=\r\ 3,7,0 h a in rd ~ Zi,Z g(Z) = [op(move, r^, r, «2)J 
3,7 I- {id = a) ~ h,l2,le 

3,7 I- Si ~ Zi,Z,Ze 3,7 I" S2 ~ 

3,7 I" (si;s2) ~ h,l2,le 

g,J,(l)'r hjtruejfalse 
3, 7 h Si ~ Itrue, h,le 3, 7 l~ ^2 ~ lfalse,h,le 

3,7 h if(c){si} else {52} ~ h,l2,le 

3(^1) = [nOp(/)J 3,7 h S ~ Z,Zi,ie g,'y^Sr^li,l2,l2-le 

3,7 I- loop{s} ~ li,l2,le 3,7 I- block{s} ~ li,l2,le 

le{n) = [h\ 

3,7 I- exit(n) ~ h,l2,le 

The specification for the translation of CminorSel functions to RTL functions is, then: 
if \F\ = [F'J , it must be the case that 

F .code, 7 h F.body ~ F .entrypoint, I, e, Ig, I, 

for some I, Ig and injective 7; moreover, the CFG node at I must contain a retum(rr) 
instruction appropriate to the signature of F. 
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6.3 Semantic preservation 

We now outline the proof of semantic preservation for RTL generation. Consider a Cmi- 
norSel program P and an RTL program P' , with G, G' being the corresponding global 
environments. Assuming that P' is an acceptable translation of P according to the 
relational specification of section 6.2, we show that executions of P are simulated by 
executions of P' . For the evaluation of CminorSel expressions, the simulation argument 
is of the form "if expression a evaluates to value v, the generated RTL code performs 
a sequence of transitions from li to I2, where li,l2 delimit the sub-graph of the CFG 
corresponding to o, and leaves v in the given destination register r^" . Agreement be- 
tween a CminorSel environment E and an RTL register state R is written 7 h £ ~ J? 
and defined as i?(7(a;)) = E{x) for all x € Dom(7). 

Lemma 5 Assume F.code,7, tt h o in r^; ~ I1J2 O'l^d 7 is injective. 
IfG,a,E,M\-a^v and ^ \- E ~ R, there exists R! such that 

1. The RTL code executes from li to I2: 

G' h <S(r, F, cT, Zi, J?, M) S{E, F, a, h, R' , M) 

2. Register r^ contains the value of v at the end of this execution: R'{rj) = v 

3. The values of preserved registers are unchanged: R'{r) = R{r) for all r 6 Rng(7)U7r. 
This implies -y \- E R' in particular. 

This lemma, along with a similar lemma for condition expressions, is proved by induc- 
tion on the CminorSel evaluation derivation. To relate CminorSel and RTL execution 
states, we need to define a correspondence between CminorSel continuations and RTL 
call stacks. A continuation k interleaves two aspects that are handled separately in 
RTL. The parts of k that lie between returnto markers, namely the continuations s; k' 
and endblock(fe'), correspond to execution paths within the current function. These 
paths connect the several possible end points for the current statement with the final 
return of the function. Just as a statement s is associated with a "fan-out" subgraph 
of the CFG (one start point, several end points), this part of k is associated with a 
"fan-in" subgraph of the CFG (several start points, one end point corresponding to a 
return instruction). The other parts of k, namely the returnto markers, are in one-to- 
one correspondence with frames on the RTL call stack S. We formalize these intuitions 
using two mutually inductive predicates: p, 7 h k ~ l2,le,lg,lr,rr, S for the local part 
of the continuation fe, and fe ~ 5 for call continuations k. The definitions are omitted 
for brevity. The invariant between CminorSel states and RTL states is, then, of the 
following form. 

g,j \- s h,l2,le,lg,lr,rl g,'y \- k ^ hje, Ig , Ir , rl , E 'y\-E~R 

S{F,s,k,a,E,M) ~ S{S , g, a,li, R, M) 

[Fdj = [Fd'\ k^S kr^E 

C{Fd, V, k, M) - C{E, Fd' , v, M) n{v, k, M) ~ n{E, v, M) 

The proof of semantic preservation for statements and functions is a simulation di- 
agram of the "star" kind (see section 3.7). Several CminorSel transitions become no- 
operations in the translated RTL code, such as self assignments id = id and exit(n) 
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constructs. Wc therefore need to define a measure over CminorSel states that de- 
creases on such potentially stuttering steps. After trial and error, an appropriate mea- 
sure for regular states is the lexicographically-ordered pair of nonnegative integers 
\S{F,s,k. a,E,M)\ — {\s\ + \k\, \s\) where |s| is the number of nodes in the abstract 

syntax tree for s, and 

|s; A;| = 1 -I- |s| -I- |A:| |endblock(A:)| = 1 -|- |fc| |fc| = otherwise. 
The measure for call states and return states is (0,0). 

Lemma 6 If G \- Si -'^ S2 and Si ^ S'l, either there exists S'2 such that G' I- S'l 

S2 and S[ S'2, or 1 52 1 < \Si\ and there exists S2 such that G' h S'l — >* S'2 and 

Si ~ S2. 

The proof is a long case analysis on the transition G h Si -^82- Semantic preservation 
for RTL generation then follows from theorem 4. 

6.4 Functional implementation of the translation 

We now return to the question left open at the beginning of section 6.3: how to define 

the generation of RTL as a Coq function? Naturally, the translation proceeds by a 
recursive traversal of CminorSel expressions and statements, incrementally adding the 
corresponding instructions to the CFG and generating fresh temporary registers to 
hold intermediate results within expressions. Additionally, the translation may fail, 
e.g. if an undeclared local variable is referenced or a statement label is defined several 
times. This would cause no programming difficulties in a language featuring mutation 
and exceptions, but these luxuries axe not available in Coq, which is a pure functional 
language. We therefore use a monadic programming style using the state-and-error 
monad. The cornpilc-tirnc state is a triple {g,l,r) where g is the eurrcut state of the 
CFG, I the next unused CFG node label, and r the next unused pseudo-register. Every 
translation that computes (imperatively) a result of type a becomes a pure function 
with tj'pc mon(a) = state —f Error | DK(state X a). Besides the familiar ret and bind 
monadic combinators, the basic operations of this monad are: 

— newreg : mon(reg) generates a fresh temporary register (by incrementing the r 
component of the state); 

— add_instr(i) : inon(node) allocates a fresh CFG node /, stores the instruction i in 
this node, and returns I; 

— reserve.instr : mon(node) allocates and returns a fresh CFG node, leaving it 
empty; 

— update_instr(Z, i) : mon(imit) stores instruction i in node Z, raising an error if node 
I is not empty. 

(The latter two operations are used when compiling loops and labeled statements.) 
The translation functions, then, are of the following form: 

transl_expr(7, a, r^t, I2) ■ mon(node) 

transl_exprlist(7, a, r^, Z2) : mon(node) 

transl_condition(7, c, Ztrue, Z/alse) • iiion(node) 
transl_stmt(7, s, I2, le,lg,lr, rl) : inon(node) 
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These functions recursively add to the CFG the instructions that compute the given 

expression or statement and branch to the given end nodes l2,le, Each call returns 

the node of the first instruction in this sequence (the node written li in the relational 
specification). The following Coq excerpts from transl_stmt should give the flavor of 

the translation functions: 

match s with 
I Sskip => 

ret 12 
I Sassign v a => 

do rd <- new_reg; 

do rv <- find.var 7 v; 

do 1 <- add_instr (lop Omove (rd: :nil) rv 12); 

traiisl_expr 7 a rd 1 
I Sseq si s2 => 

do 1 <- transl_stmt 7 s2 12 lexit Igoto Iret rret; 

transl_stmt 7 si 1 lexit Igoto Iret rret 
I Sloop s => 

do 1 <- reserve_instr; 

do 1' <- transl_stmt 7 s 1 lexit Igoto Iret rret; 
do X <- update_instr 1 (Inop 1'); 
ret 1 

Inspired by Haskell, do x <- a; & is a user-defined Coq notation standing for 
bind a (Ax. 6). Two syntactic invariants of the state play a crucial role in proving the 
correctness of the generated CFG against the relational specification of section 6.2. 
First, in a cornpilc-tirnc state (g,l,r), all CFG nodes above / must bo empty: Vi' > /, 
g{l') = 0. Second, the state evolves in a monotone way: nodes are only added to the 
CFG, but an already filled node is never modified; likewise, temporary registers are 
never reused. (If this were not the case, correct sub-graphs constructed by recursive 
invocations to the transl functions could become incorrect after later modifications of 
the CFG.) We define this monotonicity property as a partial order :< between states: 

igi,h,ri) ^ (92,12, r2) =' ^1 < '2 Ari < 7-2 A (V/,j, gi{l) = [i\ ^ 92(1) = 

It is straightforward but tedious to show that these two invariants are satisfied by the 
monadic translation functions, since they hold for the basic operations of the monad 
and are preserved by monadic bind composition. However, wo can avoid much proof 
efi'ort by taking advantage of Coq's dependent types. The first invariant (of one state) 
can be made a part of the state itself, which becomes a dependent record type: 

Record state: Set := mkstate { 
st_nextreg: reg; 
st_nextnode: node; 
st_code: graph; 

st_wf : forall (1: node), 1 >= st_nextnode -> st_code!l = None 

}. 

Note the 4*^ field st_wf , which is a proof term that the first invariant holds. The second 

invariant (the partial order ;<) is more difficult, as it involves two states. However, it 
can be expressed in the definition of the mon(a) type by turning the function type 
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state . . . into a dcpondont function type of the shape n{x : state), {y \ P x y}. 
Here is the Coq definition of the dcpendently-typed monad: 

Inductive res (A: Set) (s: state): Set := 
I Error: res A s 

I DK: A -> forall (s': state), s ^ s' -> res A s. 
Definition mon (A: Set) : Set := forall (s: state), res A s. 

The result of a successful monadic computation of type mon(a) starting in state s is 
OK(x, s' , tt) where a; : a is the return value, s' the final state, and tt a proof term for 
the proposition s ^ s'. The two invariants need to be proved when defining the basic 
operations of the monad; for instance, in the case of ret and bind, the corresponding 
proofs amount to rcflcxivity and transitivity of ^. However, they then automatically 
hold for all computations in the dependently-typed monad. 

7 Optimizations based on dataflow analysis 

We now describe two optimization passes performed on the RTL form: constant prop- 
agation and common subexpression elimination. Both passes make use of a generic 
solver for dataflow inequations, which we describe first. 

7.1 Generic solvers for dataflow inequations 

We formalize forward dataflow analyses as follows. We are given a control-flow graph (as 
a function successors : node list (node)) and a transfer function T : node xA^A, 
where A is the type of abstract values (the results of the analysis), equipped with a 
partial order >. Intuitively, T{1) computes the abstract value "after" the instruction 
at point I as a function of the abstract value "before" this instruction. Wc arc also 
given a set cstrs of pairs of a CFG node and an abstract value a : A, representing 
e.g. requirements on the CFG entry point. The result of forward dataflow analysis is a 
solution A : node —fAto the following forward dataflow inequations: 

-4(s) > T{l,A(l)) for all s £ successors(0 
A{1) > a for all {I, a) € cstrs 

Wc formalize dataflow analysis as inequations instead of the usual equations A{1) — 
\_\{T{p,A{p)) I I G successors(p)} because we are interested only in the correctness 
of the solutions, not in their optimality. 

Two solvers for dataflow inequations are provided as Coq functors, that is, modules 
parameterized by a module defining the type A and its operations. The first solver 
implements Kildall's worklist algorithm [48]. It is applicable if the type A is equipped 
with a decidable equality, a least element -L and an upper bound operation U. (Again, 
since we are not interested in optimality of the results, U is not required to compute 
the least upper bound.) The second solver performs propagation over extended basic 
blocks, setting A{1) = T in the solution for all points / that have several predecessors. 
The only requirement over the type .4 is that it possesses a greatest element T. This 
propagation-based solver is useful in cases where upper bounds do not always exist or 
are too expensive to compute. 
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Several mechanical verifications of Kildall's worklist algorithm have been published 
already [5,26,49,12]. Therefore, we do not detail the correctness proofs for our solvers, 
referring the reader to the cited papers and to the Coq development for more details. 

The solvers actually return an option type, with denoting failure and [^J de- 
noting success with solution A. For simplicity, we do not require that the CFG is 
finite, nor in the case of Kildall's algorithm that the > ordering over A is well founded 
(no infinite ascending chains). Consequently, we cannot guarantee termination of the 
solvers and must allow them to fail, at least formally. The implementations of the two 
solvers bound the number of iterations performed to find a fixed point, returning if 
a solution cannot be found in N iterations, where N is a very large constant. Alter- 
natively, unbounded iteration can be implemented using the approach of Bertot and 
Komondantsky [13], which uses classical logic and Tarski's theorem to model general 
recursion in Coq. Yet another alternative is to use the "verified validator" approach, 
where the computation of the solution is delegated to external, untrusted code, then 
verified a posteriori to satisfy the datafiow inequations. In all these approaches, if the 
static analysis fails, we can either abort the compilation process or simply turn off the 
corresponding optimization pass, returning the input code unchanged. 

Solvers for backward dataflow inequations can be easily derived from the forward 
solvers by reversing the edges of the control-flow graph. In the backward case, the 
transfer function T{1) computes the abstract valu(> ''before" the instruction at point I 
as a function of the abstract value "after" this instruction. The solution A returned by 
the solvers satisfies the backward dataflow inequations: 

A{1) > T{s,A{s)) for all s e successors(0 
A{1) > a for all {I, a) € cstrs 



7.2 Constant propagation 
7.2.1 Static analysis 

Constant propagation for a given function starts with a forward dataflow analysis using 
the following domain of abstract values: 

A = r i-» (T I _L I Int(n) j Float(/) | Addrsymbol(i(i + 5) 

That is, at each program point and for each register r, we record whether its value 
at this point is statically known to be equal to an integer n, a float /, or the address 
of a symbol id plus an offset 5, or is unknown (T), or whether this program point is 
unreachable (_L). Kildall's algorithm is used to solve the dataflow inequations, with 
the additional constraint that ^(F.entrypoint) > (r i— > T). If it fails to flnd a proper 
solution A, we take the trivial solution A(l) = (r i-^ T), effectively turning off the 
optimization. For each function F of the program, we write aiialyze(F) for the solution 
(proper or trivial) of the dataflow equations for F. 

The transfer function Tp is the obvious abstract interpretation of RTL's semantics 
on this domain: 

a{r ^ eval_op(op, o(r))} if F.code{l) — [op{op,f,r,l')] 
a{r <— T} if F.code(i) = [load(«;, mode,f,r,l')} 

a{r <- T} if F.code(Z) = [ca.ll{sig, _, r, «')J 

a otherwise 



TF{l,a) = { 
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Here, eval_op is the abstract interpretation over the domain A of the eval_op function 
defining the semantics of operators. By lack of an alias analysis, we do not attempt to 
track constant values stored in memory; therefore, the abstract value of the result of a 
load is T. 

7.2.2 Code transformation 

The code transformation exploiting the results of this analysis is straightforward: op 
instructions become "load constant" instructions if the values of all argument registers 
are statically known; cond instructions where the condition can be statically deter- 
mined to be always true or always false are turned into nop instructions branching 
to the appropriate successor; finally, operators, conditions and addressing modes are 
specialized to cheaper immediate forms if the values of some of their arguments are 
statically known. The structure of the control-flow graph is preserved (no node is in- 
serted nor deleted), making this transformation easy to express as a morphism over 
the CFG. Paxts of the CFG can become unreachable as a consequence of statically 
resolving some cond instructions. The corresponding instructions, as well as the nop 
instructions that replace the statically-resolved cond instructions, will be removed later 
during branch tunneling and CFG linearization (sections 9 and 10). 

7.2.3 Semantic preservation 

The proof of semantic preservation for constant propagation is based on a "lock-step" 
simulation diagram. The central invariant of the diagram is the following: at every 
program point / within a function F, the concrete values R{r) of registers r must 
agree with the abstract values analyze (F)(Z)(r) predicted by the dataflow analysis. 
Agreement between a concrete and an abstract value is written \= a : v and defined as 
follows. 



We write \=R:Ato mean |= R(r) : A(r) for all registers r. The first part of the proof 
shows the correctness of the abstract interpretation with respect to this agreement 
relation. For example, if eval_op(op, -y) = [vj and \= v : a, we show that \= v : 
eval_op(op, o). Likewise, we show that the specialized forms of operations, addressing 
modes and conditions produced by the code transformation compute the same values 
as the original forms, provided the concrete arguments agree with the abstract values 
used to determine the specialized forms. These proofs axe large but straightforward 
case analyses. We then define the relation between pairs of RTL states that is invariant 
under execution steps. 



1= v : T 



1= int(n) : Int(n) 

symbol(G, id) = [fej 



^float(/) : Float (/) 



1= ptr(&, 5) : Addrsymbol(jd -|- 5) 



1= R : analyze (F)(;) 



S{E,F.coA.e,a,l,R,M) 
lFd\ = lFd'\ E r^E' 



S{E',F'. code, a,l,R,M) 
E ^ E' 



C{E, Fd, V, M) ~ C{E\ Fd', V, M) 



C{E,v,M) ^C{E',v,M) 
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IF] = [F'J E^E' Vv, 1= R{r ^ v} : analyze (F)(Z) 

£ ~ e 

J^(r, F.code, a, I, R).S ~ J^{r, F'.code, cr, Z, J?).X'' 

In the rule for stack frames, v stands for the value that will eventually be returned 
to the pending function call. Since it is unpredictable, we quantify universally over 
this return value. Semantic preservation for constant propagation follows by theorem 3 
from the simulation lemma below. 

Lemma 7 If G \- Si S2 and Si ~ Si , there exists S'2 such that G' \- S'l S'2 and 

Si ^ 5*2, 

7.3 Common subexpression elimination 
7.3.1 Static analysis 

Common subexpression elimination is implemented via local value numbering per- 
formed over extended basic blocks. Value numbers x are identifiers representing run- 
time values abstractly. A forward dataflow analysis associates to each program point 
l,F a, pair (cj>, rj) of a partial mapping (j> from registers to value numbers and a set 1) 
of equations between value numbers of the form x = op{x) or x = k, m.ode{x). In a 
sense that will be made somantically precise below, the first equation form means that 
the operation op applied to concrete values matching the value numbers x returns a 
concrete value that matches x; likewise, the second equation means that computing an 
address using the addressing mode mode applied to values matching x and loading a 
quantity k from this address in the current memory state returns a value matching x. 
In addition to (p and rj, the analysis result at each point also contains a supply of fresh 
value numbers and a proof that value numbers appearing in (j> and rj are not fresh with 
respect to this supply. We omit these two additional components for simplicity. The 
transfer function Tp{l,A), where A = {<p,r]) is the analysis result "before" point I, is 
defined as follows. If the instruction at Z is a move op(move, r^, r^, Z'), we record the 
equality rj^ = rg by returning ((t>{r^ ^ (j){rs)}, rj). If the instruction at I is op(op, r, r, l'), 
we determine the value numbers x — <j>{r), associating fresh value numbers to elements 
of f if needed. We then check whether ri contains an equation of the form x = op{x) for 
some x. If so, the computation performed by this op operation has already been per- 
formed earlier in the program and the transfer function returns {<j){r ^ x},ri). If not, 
we allocate a fresh value number x to stand for the result of this operation and return 
{(p{r •<— a;}, r)U{x = op{x)}) as the result of the transfer function. The deflnition of the 
transfer function is similar in the case of load instructions, using memory equalities 
X = K, modeix) instead of arithmetic equalities x — op{x). For store instructions, in 
the absence of nonaliasing information we must assume that the store can invalidate 
any of the memory equations that currently hold. The transfer function therefore re- 
moves all such equations from 77. For call instructions, we remove all equations, since 
eliminating common subexpressions across function calls is often detrimental to the 
quality of register allocation. Other instructions keep {4>, rj) unchanged. We say that a 
value numbering {<p, r/) is satisfiable in a register state R and a memory state M, and 
we write R, M ^ 0, 77, if there exists a valuation V associating concrete values to value 
numbers such that 



1. If 4>{r) = \_x\, then R{r) = V{x). 
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2. If ri contains the equation x = op{x), then eval_op(op, V(x)) = [V^(x)J. 

3. If ri contains the equation x = k, mode{x), then there exists b and 6 such that 
eval_ad<iressing(mode, = [ptr(6, (5)J and load(/t, M, 6, 5) = [F(a;)J. 

Using this notion of satisfaction, we order value numberings by entailment: (</>', r]') > 
{4>,r]) if R,M ^ ij>,ri ^ R,M ^ (t>',r/ for all R,M. Least upper bounds for this 
ordering are known to be difficult to compute efficiently [48,38]. We sidestep this issue 
by solving the dataflow inequation using the approximate solver based on propagation 
described in section 7.1 instead of Kildall's algorithm. 

7.3.2 Code transformation 

The actual elimination of common subexpressions is simple. Consider an instruction i at 
point I, and let (0, rj) be the results of the static analysis at point /. If i is op(op, f, r, l') 
or load(K;, mode,f,r,l'), and there exists a register r' such that the equation 4>{r') = 
op{(j){r)) or respectively (j){r ) = /t, mode{4>{f)) is in 77, we rewrite the instruction as a 
move op(move, r, r' , l') from r' to r. This eliminates the redundant computation, reusing 
instead the result of the previous equivalent computation, which is still available in 
register r' . In all other cases, the instruction is unchanged. 

7.3.3 Semantic preservation 

The correctness proof for common subexpression elimination closely follows the pattern 
of the proof for constant propagation (see section 7.2.3), using a lock-step simulation 
diagram. The only difference is the replacement of the hypothesis |= 7? : A{1) in the 
invariant between states by the hypothesis R,M \= A{1), meaning that at every pro- 
gram point the current register and memory states must satisfy the value numbering 
obtained by static analysis. 



8 Register allocation 

The next compilation pass performs register allocation by coloring of an interference 
graph. 



8.1 The target language: LTL 

The target language for register allocation is a variant of RTL called LTL (Location 

Transfer Language). Syntactically, the only difference between LTL and RTL is that the 
RTL pseudo-registers r appearing as arguments and results of instructions are replaced 
by locations I. A location is either a hardware register or an abstract designation of a 
stack slot. 

Locations: (. ::= rm \ s 

Machine registers: rm '■'■= R3 | R4 | . . . PowerPC integer registers 
I Fl I F2 I . . . PowerPC float registers 
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F.code(0 = lca.ll{sig,£f, £, £, I')} L{tf) = ptr(6, 0) funct(G, b) = \_Fd\ Fd.sig = sig 
G h S{E, F, a, I, L, M) A C(^(^, F, a, I', postcall(L)).i:, Fd, Lif), M) 

Fig. 12 Semantics of LTL. The transitions not shown are similar to those of RTL. 



Stack slots: s ::= local(r, 5) local variables 

I incoming(r, 5) incoming parameters 
I outgoing(r, 5) outgoing arguments 

In stack slots, r is the intended type of the slot (int or float) and 5 an integer 
representing a word offset in one of the three areas of the activation record (local, 
incoming and outgoing). 

In LTL, stack slots are not yet mapped to actual memory locations. Their values, 
along with those of the machine registers, arc recorded in a mapping L : loc val 
similar to the mapping R : reg — > val used in the RTL semantics and disjoint from the 
memory state M. However, the LTL semantics treats locations in a way that anticipates 
their behaviors once they are later mapped to actual memory locations and processor 
registers. In particular, we account for the possible overlap between distinct stack slots 
once they are mapped to actual memory areas. For instance, outgoing(f loat, 0) over- 
laps with outgoing(int, 0) and outgoing(int, 1): assigning to one of these locations 
invalidates the values of the other two. This is reflected in the weak "good variable" 
property for location maps L: 

(v if ^1=^2; 

(L{£i ^ v}){i2) = I L{l-2) if h and ^2 do not overlap; 

imdef if £1 and £2 partially overlap. 

Contrast with the standard "good variable" property for register maps R: 

(R{n^v}){r,) = {l, - 

y R{r2) if ri / r2- 

The dynamic semantics of LTL is illustrated in figure 12. Apart from the use of 
location maps L and overlap-aware update instead of register maps R and normal 
updates, the only significant difference with the semantics of RTL is the semantics of 
call instructions. In preparation for enforcing calling conventions as described later 
in section 11, processor registers that are temporary or caller-save according to the 
calling conventions are set to undef in the location state of the caller, using the function 
postcall defined as 



postcall(L)(£) 



undef if £ is a temporary or caller-save register; 

L{£) otherwise. 



This forces the LTL producer, namely the register allocation pass, to ensure that no 
value live across a function call is stored in a caller-save register. 



8.2 Code transformation 



For every function, register allocation is performed in four steps, which we now outline. 
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8.2.1 Type reconstruction for RTL 

The first step performs type reconstruction for the RTL source function in a trivial 
"int-or-f loat" type system similar to that of Cminor (sec section 4.3). To each RTL 
pseudo-register we assign a type int or float, based on its uses within the function. 
The resulting type assignment F : r t will guide the register allocator, making sure 
a machine register or stack location of the correct kind is assigned to each pseudo- 
register. 

All operators, addressing modes and conditions are monomorphic (each of their 
arguments has only one possible type, either int or float), except the move operation 
which is polymorphic with type Vr. r — > r. Type reconstruction can therefore be 
performed by trivial unification. We use the verified validator approach: a candidate 
type assignment F is computed by untrusted Caml code, using in-place unification, 
then verified for correctness by a simple type-checker, written and proved correct in 
Coq. 

8.2.2 Liveness analysis 

We compute the set A{1) of pseudo-registers live "after" every program point I. These 
sets are a solution of the following backward dataflow inequations: 



The transfer function T computes the set of live pseudo-registers "before" an instruc- 
tion, as a function of this instruction and the set of live pseudo-registers "after" it. 
Classieally, it removes the registers defined by the instruction, then adds the registers 
used. A special case is made for op or load instructions whose result register is not live 
"after" the instruction, since these instructions will later be eliminated as dead code. 
For instance, if the instruction at Z in F is op(op, r, r, i'), then 



Liveness information is computed by Kildall's algorithm, using the generic dataflow 
solver described in section 7.1. 

8.2.3 Construction of the interference graph 

Based on the results of liveness analysis, an interference graph is built following 

Cliaitin's construction [20]. Two kinds of interferences are recorded: between two 
pseudo-registers (r, r') and between a pseudo-register r and a machine register rm- 
To enable coalescing during allocation, register affinities arising from moves (either 
explicit or implicit through calling conventions) are also recorded. (Affinities do not 
affect the correctness of the generated code but have a considerable impact on its 
performance.) 

The interference graph is represented by sets of edges: either unordered pairs (r, r ) 
or ordered pairs (r, rm). The graph is constructed incrementally by enumerating every 
RTL instruction and adding interference edges between the defined register r (if any) 
and the registers r' £ A{1) \ {r} live "across" the instruction. For move instructions 
I : op{move, rs,r^,l'), we avoid adding an edge between rs and r^, as proposed by 
Chaitin [20]. Finally, for call instructions, additional interference edges are introduced 
pairwise between the registers live across the call and the caller-save machine registers. 



A{1) D T{s,A{s)) for all s successor of I 
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8.2-4 Coloring of the interference graph 

To color the interference graph, we use the iterated coalescing algorithm of George 
and Appcl [35]. The result is a mapping # : r i-^ £ from psoudo- registers to locations. 
Following once more the verified validator approach, the actual coloring is performed 
by an untrusted implementation of the Gcorge-Appel algorithm written in Caml and 
using imperative doubly-linked lists for efficiency. The candidate coloring returned is 
then verified by a simple validator written and proved correct in Coq. Like many NP- 
complete problems, graph coloring is a paradigmatic example of an algorithm that is 
significantly easier to validate a posteriori than to prove correct. Validation proceeds 
by enumerating the nodes and edges of the interference graph, checking the following 
properties: 

1. Correct colors: <P{r) 7^ $(r') for all edges (r, r') of the interference graph; likewise, 
${r) ^ rm for all interference edges (r, m). 

2. Register class preservation: the type of the location ${r) is equal to r{r) for all 
pseudo-registers r. 

3. Validity of locations: for all r, the location ^(r) is either a local stack slot or a 
non-temporary machine register^ . 

8.2.5 LTL generation 

Finally, the actual code transformation from RTL to LTL is a trivial por-instruction 
rewriting of the CFG where each mention of a pseudo-register r is replaced by the 
location <P(r) allocated to r. For instance, the RTL instruction / : op(op, f, r, I') becomes 
the LTL instruction / : op(op, 0{r),<P{r), l'). There arc two exceptions to this rule. First, 
a move instruction I : op(move, rs, r^, /') such that <P{ts) = ${rj} is turned into a no- 
operation I : nop(i'), therefore performing one step of coalescing. Second, an op or load 
instruction whose result register is not live after the instruction is similarly turned into 
a nop instruction, therefore performing dead-code elimination. 

8.3 Semantic preservation 

The proof that register allocation preserves program behaviors is, once more, based on 
a lock-step simulation diagram. The invariant between states is, however, more complex 
than those used for constant propagation or common subexpression elimination. For 
these two optimizations, the register state R was identical between matching states, 
because the same values would be computed (by possibly different instructions) in the 
original and transformed program and would be stored in the same registers. This 
assumption no longer holds in the case of register allocation. 

A value computed by the original RTL program and stored in pseudo-register r is 
stored in location ${r) in the transformed LTL program. Naively, we could relate the 
RTL register state R and the LTL location state L by R{r) = L(^(r)) for all pseudo- 
registers r. However, this requirement is too strong, as it essentially precludes sharing 
a location between several pseudo-registers. 

'' Wc reserve 2 integer and 3 float machine registers as temporaries to be used later for 
spilling and reloading. These temporary registers must therefore not be used by the register 
allocator. 
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To progress towards the correct invariant, consider the semantic interpretation of 
hve and dead pseudo-registers. If a pseudo-register r is dead at point / in the original 
RTL code, then its value -R(t') has no impact on the remainder of the program execution: 
either r will never be used again, or it will be redefined before being used; in either 
case, its value R{r) could be replaced by any other value without any harm. A better 
relation between the values of pseudo-registers R and locations L at point / is therefore 

R{r) = L{${r)) for all pseudo-registers r € T{l,A{l)) 

In other words, at each program point Z, register allocation must preserve the values 
of all registers live "before" executing the instruction at this point. This property that 
we have never seen spelled out explicitly in compiler literature captures concisely and 
precisely the essence of register allocation. 

The invariant between RTL and LTL states is, then: 

Er^S' [F] = [F'\ R{r) = L{<P{r)) for all r € T{l,A{l)) 
typecheck(F) = [rj aiialyze(-F) = [A] regalloc(F, A, T) = [^\ 

S{S, F.code, a, I, R, M) ~ S{E' , F' , a, I, L, M) 

E ^E' \Fd\ = \ Fd!\ E ^ E' 

C{E, Fd, V, M) - C(r', Fd',v, M) n{E, v, M) ~ TZ{E',v, M) 

The invariant relating frames in the call stacks is similar to that relating regular states, 
with a universal quantification on the return value, as in section 7.2.3. The proof of the 

lock-step simulation diagram makes heavy use of the definition of the transfer function 
T for liveness analysis, combined with the following characterization of the register 
allocation ^ with respect to the results A of liveness analysis: 

— For a move instruction I : op(move, r^, r^, Z'), we have ^irj) ^ <?(r') for all r' € 

A{l)\{rs,ra}. 

— For other op, load or call instructions at I with destination register r, we have 
#(r) ^ for all / € A(l) \ {r}. 

— For call instructions I : csl\{sig, {rf \ id),f, r^, l'), we additionally have ${r') =^ Vm 
for all r' € Aipj \ {r^} and all callee-save registers Vm- 

9 Branch tunneling and no-op elimination 

Register coalescing and dead-code elimination, performed in the course of register al- 
location, generate a number of nop instructions. Now is a good time to "short-circuit" 
them, rendering them unreachable and making them candidates for removal during 
CFG linearization (section 10). Since nop in a CFG representation also encodes uncon- 
ditional branches, this transformation also performs branch tunneling: the elimination 
of branches to branches. 

9.1 Code transformation 

Branch tunneling rewrites the LTL control-flow graph, replacing each successor, i.e. the 
l' and Z" in instructions such as OTp{op,£,i,l') or cond( cond, ^, Z', Z"), by its effective 
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destination Dp{l'). Naively, the effective destination is computed by chasing down 
sequences of nop instructions, stopping at the first non-nop instruction: 



This is not a proper definition: if the control-flow graph contains cycles consisting only 
of nop instructions, such as I : nop(Z), the computation of Dp{l) fails to terminate. 

A simple solution is to bound the recursion depth when computing Dp{l), return- 
ing I when the counter reaches 0. The initial value N can be chosen at will; the number 
of instructions in the code of function F is a good choice. 

A more elegant solution, suggested by an anonymous reviewer, uses a union-find 
data structure Up- A. first scan of the control-flow graph populates Up hy adding an 
edge from I to l' for each instruction I : nop(Z'), provided I and l' are not already in 
the same equivalence class. The effective destination Dp{l), then, is defined as the 
canonical representative of I in the union-find structure Up. 



9.2 Semantic preservation 

Semantic preservation for branch tunneling follows from a simulation diagram of the 
"option" kind (see figure 4). Intuitively, the execution of a non-call, non- return instruc- 
tion that causes a transition from point h to point h in the original code corresponds 
to the execution of zero or one instructions in the tunneled code, from point Dp{li) 
to point Dp{l2). The "zero" case can appear for example when the instruction at is 
nop(Z2). The definition of the invariant between execution states is therefore 



for regular states, with the obvious definitions for call states, return states, and stack 
frames. For simulation diagrams of the "option" kind, we must provide a measure to 
show that it is not possible to taJte infinitely many "or zero" cases. For branch tunneling, 
a suitable measure is the number of nop instructions that are skipped starting with the 
current program point: 



This definition is well founded because of the way is constructed from the union- 
find structure Up. The crucial property of the D and #nop functions is that, for all 
program points I, 



Dp(l) = I V 3l', F.code{l) = \iiop{l')\ A Dp(l') = Dp{l) A #nopj,(/') < #nop^(0 




[F] = [F'J E^E' 



S{E, F, a, I, L, M) - S{E, F' , a, Dp{l), L, M) 



\S{E,F,sp,l,L,M)\ = 
#noip p{l) = 



#nop p{l) 

r 1 + #nopp{l') if F.code(/) = Lnop(/')J and Dp{l) I; 



otherwise. 



Theorem 6 If Si ^ S2 and G h Si S'l, either there exists S2 such that G' h S2 —* 
S2 and S'l ~ 5*2, or \S[ \ < \Si\ and t = e and S'l ^52. 
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10 Linearization of the control-flow graph 

The next compilation step linearizes the control-flow graphs of LTL, replacing them 

by lists of instructions with explicit labels and unconditional and conditional branches 
to labels, in the style of assembly code. While the CFG representation of control is 
very convenient for performing dataflow analyses, the linearized representation makes 
it easier to insert new instructions, as needed by some of the subsequent passes. 

Discussions of linearization in the literature focus on trace picking heuristics that 
reduce the number of jumps introduced, but consider the actual production of linearized 
code trivial. Our first attempts at proving directly the correctness of a trace picking 
algorithm that builds linearized code and performs branch tunneling on the fly showed 
that this is not so trivial. We therefore perform tunneling in a separate pass (see 
section 9), then implement CFG linearization in a way that clearly separates heuristics 
from the actual production of linearized code. 



10.1 The target language: LTLin 

The target language for CFG linearization is LTLin, a variant of LTL where control- flow 
graphs are replaced by lists of instructions. 



LTLin instructions: 

i ::= op(op, £, £) 

load(«:, mode, £, t) 
store(K, mode,£, i) 
csLll{sig,ii I id),£,(.) 
tailcall(si5, {£ \ id),£) 
cond{cond,£, Itrue) 
goto(/) 
label(0 

return | return (£) 

LTLin code sequences: 
c ::= ii...in 

LTLin functions: 

F ::= { sig = sig; ^ 
params = £; 
stacksize = n; 
code = c} 



arithmetic operation 
memory load 
memory store 
function call 
function tail call 
conditional branch 
unconditional branch 
definition of the label I 
function return 

list of instructions 



parameters 

size of stack data block 
instructions 



The dynamic semantics of LTLin is similar to that of LTL as far as the handling 
of data is concerned. In execution states, program points within a function F are no 
longer represented by CFG labels I, but instead are represented by code sequences c 
that are suffixes of F.code. The first element of c is the instruction to execute next. As 
shown in figure 13, most instructions have "fall-through" behavior: they transition from 
i.c to c. To resolve branches to a label I, we use the auxiliary function f indlabel(F, /) 
that returns the majcimal suffix of F.code that starts with label(/), if it exists. 
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eval_op(G, cr, op,L{£)) = [v\ 
G h S{E, F, a, op(op, £, l).c, L, M) S{S, F, a, c, L{1 ^ v}, M) 
findlabel(F,it™e) = \c' \ 
G h S{E, F, a, goto(0.c, L, M) S{E, F, a, c', L, M) 
eval_cond(cond, L(^) = [truej f indlabel(F, /true) = L'^'J 
G h S{S, F, a, cond(cond, I, ltme)-c, L, M) S{E, F, c, c' , L, M) 
eval_cond(cond, L((,)) = [f alsej 
G'rS{S, F, c, cond(corMi, I, ltrue)-c, L, M) -4 S{S, F, a, c, L, M) 

Fig. 13 Semantics of LTLin (selected rules). 
10.2 Code transformation 

CFG linearization is performed in two steps that clearly separate the heuristic, 
correctness-irrelevant part of linearization from the actual, correctness-critical code 
generation part [1, chap. 8]. We first compute an enumeration h . . .In of the labels of 
the CFG nodes reachable from the entry node. The order of labels in this enumeration 
dictates the positions of the corresponding instructions in the list of LTLin instructions 
that we will generate. Following the verified validator approach, this enumeration of 
CFG nodes is computed by untrusted code written in Caml. It can implement any of 
the textbook heuristics for picking "hot" traces that should execute without branches, 
including static branch prediction. This enumeration is validated by a validator 
written in Coq that checks the following two conditions: 

1. No node I appears twice in the enumeration. 

2. All nodes I reachable from the function entry point appear in the enumeration. 

For condition 2, the validator precomputes the set of reachable nodes using a trivial 
forward dataflow analysis, where the abstract domain is {false, true} and the transfer 
function is T{1, a) = a. (By definition, all successors of a reachable node are reachable). 

To generate LTLin code, we then concatenate the instructions of the control-flow 
graph in the order given by the enumeration li,. . . ,ln- Each instruction is preceded 
by label(Zi) and followed by a goto to its successor, unless this goto is unnecessary 
because it would branch to an immediately following label. Formally, the basic code 
generation function is of the form C(i,c) and returns a sequence c' of instructions 
obtained by prepending the translation of the LTL instruction i to the initial instruction 
sequence c. Here are some representative cases: 

C{op{op,e,e,l'),c) = oTp{op,£,£).c if c starts with label(Z'); 
C{op{op,£,£,l'),c) — op{op,£,£).goto(l/).c otherwise. 
C(cond{cond,£,ltrue,lfalse),c) = cond(cond, £, krue) -c 

if c starts with label(Zj(j;se); 
C{cond{cond,£,ltrue,lfalse),c) = cond{-:cond,£,lfaise)-c 

if c starts with label (Ztrue); 
C{cond{cond,£, hme, Ifalse), c) = cond(cond,£, ltrue)-gOto{lfaise)-C 

otherwise. 
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This function is then iterated over the enumeration I of CFG nodes, inserting the 
appropriate label instructions: 

C{F,e) = e 

C{F,l.l) = label{l).C{i,C{F,T)) if F.code(Z) = [ij 



10.3 Semantic preservation 

Each intra-function transition in the original LTL code corresponds to 2 or 3 transitions 
in the generated LTLin code: one to skip the label instruction, one to execute the 
actual instruction, and possibly one to perform the goto to the successor. The proof of 
semantic preservation is therefore based on a simulation diagram of the "plus" kind. 
The main invariant is: whenever the LTL program is at program point I in function F, 
the LTLin program is at the instruction sequence f indlabel(F', Z) in the translation 
F' of F. 

[Fj = IF'\ E^E' 
f irLdlabel(_F', Z) = [cj I is reachable from F.entrypoint 

S[E, F, a, I, L, M) ~ S{E', F' , a, c, L, M) 

A pleasant surprise is that the simulation proof goes through under very weak as- 
sumptions about the enumeration of CFG nodes produced by the external heuristics: 
conditions (1) and (2) of section 10.2 are all it takes to guarantee semantic preserva- 
tion. This shows that our presentation of linearization is robust: many trace picking 
heuristics can be tried without having to redo any of the semantic preservation proofs. 



11 Spilling, reloading, and materialization of calling conventions 

The next compilation pass finishes the register allocation process described in section 8 
by inserting explicit "spill" and "reload" operations around uses of pseudo-registers 

that have been allocated stack slots. Additionally, calling conventions are materialized 
in the generated code by inserting moves to and from the conventional locations used 
for parameter passing around function calls. 



11.1 The target language: Linear 

Linear, the target language for this pass, is a variant of LTLin where the operands 
of arithmetic operations, memory accesses and conditional branches are restricted to 
machine registers instead of arbitrary locations. This is consistent with the RISC in- 
struction set of our target processor. (Machine registers, written Vm so far, will now 
be written r for simplicity.) Two instructions getstack and setstack are provided to 
move data between machine registers and stack slots s. 
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Linear instructions: 

i ::= getstack(s, r) I setstack(r, s) 
op(op,r,r) 
load(K, mode, f, r) 
store(K, mode, f, r) 
call(sip, (r | id)) 
tailcall(si(ji, (r | id)) 

COIld{cond,f,ltrue) 

goto(Z) 
label(0 
return 

Linear code sequences: 
c ::= ii.. An 

Linear functions: 

F ::= { sig = sig; 

stacksize = n; 
code = c} 



reading, writing a stack slot 
arithmetic operation 
memory load 
memory store 
function call 
function tail call 
conditional branch 
unconditional branch 
definition of the label I 
function return 

list of instructions 



size of stack data block 
instructions 



Another novelty of Linear is that call, tailcall and return instructions, as well as 
function definitions, no longer carry a list of locations for their parameters or results: the 
generated Linear code contains all the necessary move instructions to ensure that these 
parameters and results reside in the locations determined by the calling conventions. 
Correspondingly, call states and return states no longer carry lists of values: instead, 
they carry full location maps L where the values of arguments and results can be 
found at conventional locations, determined as a function of the signature of the called 
function. 

Program states: S ::= S{S, F, a, c, L, M) regular state 
\C{E,Fd,L,M) call state 
I TZ{E, L, M) return state 

Call stacks: S ::= {J^{F, a, c, L))* list of frames 



As shown in the dynamic semantics for Linear (see figure 14), the behavior of 
locations across function calls is specified by two functions: entryfun(L) determines 
the locations on entrance to the callee as a function of the locations L before the call, 
and exitf un(L, L') determines the locations in the callee when the call returns as a 
function of the caller's locations before the call, L, and the callee's locations before 
the return, L'. In summary, processor registers are global, but some are preserved by 
the callee; local and incoming slots of the caller are preserved across the call; and the 
incoming slots on entrance to the callee are the outgoing slots of the caller. 



Location I 


entryfun(L)(l) 


exitfun(L, L'){1) 


r 

local(r, 5) 
incoming(T, 5) 
outgoing(r, d) 


L(r) 
undef 

L(outgoing(r, 6)) 
undef 


L{r) if r is callee-save 
L'(r) if r is caller-save 
L(local(r, S)) 
L( incoming (t, 5)) 
L'(incoming(T, 5)) 
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eval_op(G, cr, op, L{r)) = [v] 
G h 5(17, F, 0-, op(op, r, r).c, L, M) ^ <S(I7, F, cr, c, L{r *-v],M) 
L(r) = ptr(fe, 0) fuiict(G, b) = \_Fd\ Fd.sig = sig 
G\-S{E, F, a, call(sig, r).c, L, M) A C{r{F, a, c, L).E, Fd, L, M) 
L(r-) = ptr{6, 0) funct(G, 6) = [Fci J Fd.sig = sig L' = exit fun{E. top. L, L) 
G h S(i:, F, a, tallcall(sj3, r).c, L, M) A C(i:, Fd, L', M) 
L' = exitf iiii(17.top.L, L) 
G h 5(17, F, cr, return.c, L, M) A 7e(i:, L', f ree(M, cr)) 
alloc(.\7. 0. _/''.stacksize) = ((t_\7') L' = entryfun(L) 
G h C{E, internal(F), L, M) -4 S{E, F.code, (t, F.code, L', M') 

h Fe(iT) =4-1' jT = L(loc_argiiments(Fe.sig)) L' = L{loc_result(Fe.sig) •<— i>} 

G h 0(1:, external(Fe), L, M) 7^(17, L', M) 
G h 7^(J^(F, 0-, c, La).E, L, M) ^ S{E, F, a, c, L, M) 

Fig. 14 Semantics of Linear. The transitions not shown are similar to those of LTLin. 

In other words, the entryfun and exitfun anticipate, at the level of the Linear se- 
mantics, the effect of future transformations (placement of stack slots in memory and 
insertion of function prologues and epilogues to save and restore callee-save registers) 
performed in the next compilation pass (section 12). 

In the rules for tailcall and return of figure 14, the notation 17. top. L stands 
for the L component of the top frame in stack S. A suitable default is defined for an 
empty stack. 

{T{F, a, c, L).i:).top.L = L e.top.L = (£ undef ) 
11.2 Code transformation 

Our strategy for spilling and reloading is simplistic: each use of a spilled pseudo-register 
is preceded by a getstack instruction to reload the pseudo-register in a machine reg- 
ister, and each definition is followed by a setstack instruction that spills the result. 
No attempt is made to reuse a reloaded value, nor to delay spilling. We reserve 3 in- 
teger registers and 3 float registers to hold reloaded values and results of instructions 
before spilling. (This does not follow compiler textbooks, which prescribe re- running 
register allocation to Eissign registers to reload and spill temporaries. However, it is 
difficult to prove termination for this practice, and moreover it requires semantic rea- 
soning about partially-allocated code.) The following case should give the flavor of the 
transformation: 

[op(op, £, £).c] = let r = regs_for(£) and r = reg_for(i!) in 
reloads(£, r).op(op, r, r). spill (r,^).[c] 

reg_f or(£) returns r if the location £ is a machine register r, or a temporary register of 

the appropriate type if £ is a stack slot. regs_f or(i?) docs the same for a list of locations, 
using different temporary registers for each location, spill (r, I') generates the move or 
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setstack operation that sets I to the value of register r. Symmetrically, reloads(^, f) 
generates the move or getstack operations that set r to the values of locations t. 

For call and tailcall instructions with signature sig and arguments ^, we insert 
moves from I to the locations dictated by the calling conventions. These locations (a 
mixture of processor register and outgoing stack slots) are determined as a function 
loc_arguments(si5) of the signature sig of the called function. Likewise, the result of 
the call, which is passed back in the conventional location loc_result(si5), is moved 
to the result location of the call. 

[call(s«(;, id, £, £).c] = parallel_move(£, loc_arguments(si(/)). 

ca.ll{sig, id) .spill{loc jresnlt{sig) ,£) .{cj 

Symmetrically, at entry to a function F, we insert moves from loc_parameters(_F.sig) 
to F.params. (loc_parameters(sip) is loc_arguments(sip) where outgoing slots are 
replaced by incoming slots.) 

The moves inserted for function arguments and for function parameters must im- 
plement a parallel assignment semantics: some registers can appear both as sources 
and destinations, as in (ri,r2,r3) := (r2,ri,r4). It is folklore that such parallel moves 
can be implemented by a sequence of elementary moves using at most one temporary 
register of each kind. Formulating the parallel move algorithm in Coq and proving its 
correctness was a particularly difficult part of this development. The proof is detailed 
in a separate paper [85]. 

11.3 Semantic preservation 

The correctness proof for the spilling pass is surprisingly involved because it needs to 
account for the fact that the location states L and memory states M differ significantly 
between the original LTLin code and the generated Linear code. An inessential source 
of difference is that the Linear code makes use of temporary registers and outgoing 
and incoming stack locations while the LTLin code does not. A deeper difference comes 
from the fact that, in LTLin, locations other than function parameters are consistently 
initialized to the undef value, while in Linear some of these locations (e.g. processor 
registers) just keep whatever values they had in the caller before the call instruction. 
Performing arithmetic over this undef value would cause the original LTLin program 
to go wrong, but it can still pass this value around, store it in memory locations, and 
read it back. Therefore, some undef values found in LTLin register and memory states 
can become any other value in the corresponding Linear location and memory states. 
To capture this fact, we use the "less defined than" ordering < between values defined 
by 

, / def / , . 

V < V = V = V OT V = undef 
and extended to memory states as follows: 

M <M' =^ Vk, b, 6, V, load(K, M, b, 5) = [v\ 

=^ 3v', load(/t, M', b, 5) = \y' \ Av<v' 

Leroy and Blazy [59, section 5.3] study the properties of this relation between memory 
states. It commutes nicely with the store, alloc and free operations over memory 
states. 
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Putting it all together, we define agreement L < L' between a LTLin location state 
L and a Linear location state L' as 

L < L' L{£) < L'{£) for all non- temporary registers or local stack slots £ 

and define the invariant relating LTLin and Linear execution states as follows. 

r ~ r' : F.sig L<L' M < M' 
S{E, F, a, c, L, M) ~ S{E' , [F], cr, [c], L', M') 
E r^E' : Fd.stg v < L'(loc_arguments(J'd.sig)) L' « T.top.L M < M' 
C{E, Fd, V, M) ~ C{E', Fd, L', M') 
E E' : sig v < L' (\ocjrQsu\t{sig)) L' m E.top.L M < M' 
H{E,v,M) ~ TZ{E',L',M') 

For call states and return states, the second premises capture the fact that the argument 
and return values can indeed be found in the corresponding locations dictated by 
the calling conventions, modulo the < relation between values. The third premises 
L' « 17. top. L say that the current location state L' and that of the caller X'.top.L 
assign the same values to any callee-save location. Finally, in the invariant E ^ E' : sig 
relating call stacks, a call signature sig is threaded through the call stack to make sure 
that the caller and the callee agree on the result type of the call, and therefore on the 
location used to pass the return value. 

sig.res — [intj 

e ~ e : sig 

E E' : F.sig c' = spill(loc_result(sig), £).[c] postcall(L) < L' 
JP(£,F,a,postcall(L),c).r ~ T{IF}, a, L' ,c').E' : sig 

Armed with these definitions and the proof of the parallel move algorithm of [85] , we 
prove semantic preservation for this pass using a simulation diagram of the "star" kind. 
The only part of the transformation that could cause stuttering is the elimination of 
a redundant move from £ to £. To prove that stuttering cannot happen, it suffices to 
note that the length of the LTLin instruction sequence currently executing decreases in 
this case. 



12 Construction of the activation record 

The penultimate compilation pass lays out the activation record for each function, 
allocating space for stack slots and turning accesses to slots into memory loads and 
stores. Function prologues and epilogues are added to preserve the values of callee-save 
registers. 
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12.1 The target language: Mach 



The last intermediate language in our gentle descent towards assembly language is 
called Mach. It is a variant of Linear where the three infinite supplies of stack slots 
(local, incoming and outgoing) are mapped to actual memory locations in the stack 
frames of the current function (for local and outgoing slots) or the calling function (for 
incoming slots). 

Mach instructions: i ::= setstack(r, r, (5) register to stack move 
I getstack(r, 5, r) stack to register move 
I getparent (t, (5, r) caller's stack to register move 
I . . . as in Linear 

In the three new move instructions, t is the type of the data moved and 6 its word 
offset in the corresponding activation record. 



Mach code: 
Mach functions: 



c::= ii . . . In 

F ::= { sig = sig; 

stackjiigh = n; 
stack_low = n; 
retaddr = S; 
link = S; 
code = c} 



list of instructions 

upper bound of stack data block 
lower bound of stack data block 
offset of saved return address 
offset of back link 
instructions 



Functions carry two byte offsets, retaddr and link, indicating where in the activation 

record the function prologue should save the return address into its caller and the back 
link to the activation record of its caller, respectively. 



Program states: S 



Call stacks: 



:= S{S,F,a, c,R,M) 
I C{S,Fd,R, M) 
I TZ{E,R,M) 

:= {J^{F,a,ra,c))* 



regular states 
call states 
return states 

list of frames 



Semantically, the main difference between Linear and Mach is that, in Mach, the 
register state R mapping processor registers to values is global and shared between 
caller and callee. In particular, R is not saved in the call stack, and as shown in figure 15, 
there is no automatic restoration of callee-save registers at function return; instead, the 
Mach code generator produces appropriate setstack and getstack instructions to save 
and restore used callee-save registers at function prologues and epilogues. 

The setstack and getstack instructions are interpreted as memory stores and 
loads relative to the stack pointer. We write k{t) for the memory quantity appropriate 
for storing a value of type r without losing information, namely K(int) = int32 and 
K(f loat) — f loat64. For the getparent instruction, we recover a pointer to the caller's 
stack frame by loading from the link location of our stack frame, then load from this 
pointer. This additional indirection is needed since, in our memory model, the callee's 
stack frame is not necessarily adjacent to that of the caller. This linking of stack frames 
is implemented by the rule for function entry. 

The rules for call and function entry also make provisions for saving a return 
address within the caller's function code in the retaddr location of the activation 
record. This return address (a pointer to an instruction within a code block) becomes 
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store(K;(r), M, b, S' + S,v) if = ptr(&, S') 



+ <5) if o- = ptr(6, S' 



otherwise 



otherwise 



) 



storestack(T, M, <T, (5, R(r)) = [M'J 



G h S{E, F, a, setstack(r-, r, 5).c, R, M) A S{E, F, a, c, R, M') 
loadstack(T, M, <t, 5) = [v\ 

Gh S{E, F, cr, getstack(T, 5, r).c, _R, Af) A F, cr, c, R{r ^v},M) 

loadstack(int, Af, cr, F.link) = [cr'J loadstack(T, Af, cr', 5) = [v] 

G h S{E , F, a, getparentir, S,r).c, R, M) A S{E,F,a,c,R{r ^ v},M) 

R{r) = ptr(6, 0) fimct(G, 6) = [Fdj Fd.sig = sig retaddr(F, c, ra) 

G h S{S, F, a, call(sig, r).c, R, M) A C{T{F, a, ra, c).E, Fd, R, M) 

R{r) = ptr(b, 0) funct(G, 6) = [Fd] Fd.sig = sig 
loadstack(int, Af, (7, F.link) = [I7.top.iTj loadstack(int, Af, it, F.retaddr) = [17. top. raj 

G h S{E,F,a,ta.ilca.ll{sig,r).c,R,M) C{E, Fd, R, M) 

loadstack(int, Af, cr, F.link) = [Z'.top.crJ loadstack(int, Af, cr, F.retaddr) = [E.top.ra] 



Fig. 15 Semantics of Mach (selected rules). 

relevant in the next compilation pass (generation of assembly code), but it is convenient 
to reflect it in the semantics of Mach. To this end, the semantics is parameterized 
by a predicate retaddr(_F, c, ra) that relates this return address ra with the caller's 
function F and the code sequence c that immediately follows the call instruction. 
In section 14.2, we will see how to define this predicate in a way that accurately 
characterizes the return address. The rules for return and tailcall contain premises 
that check that the values contained in the retaddr and link locations of the activation 
record were preserved during the execution of the function. 



12.2 Code transformation 

The translation from Linear to Mach proceeds in two steps. First, the Linear code is 
scanned to determine which stack slots and callee-save registers it uses. Based on this 
information, the activation record is laid out following the general shape pictured in 
figure 16. The total size of the record and the byte offsets for each of its areas are de- 
termined. Prom these offsets, we can define a function A mapping callee-save registers, 
local and outgoing stack slots to byte offsets, as suggested in figure 16. (Note that 
these offsets are negative, while positive offsets within the activation frame correspond 




G h C{E, internal(F), R, M) A 5(17, F, cr, F.code, _R, M-^) 
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local slots 



outgoing slots 



incoming slots 



Current activation frame 



Cminor stcLck data 



Saved float registers 

Float locals 

Padding 



Saved integer registers 



Integer locals 



Outgoing arguments 



Return address 
& back link 



Byte offset 



Stack pointer <t 



Outgoing arguments 



Caller's activation frame 
Fig. 16 Layout of Mach activation frames; mapping from Linear stack slots to frame locations. 



to Cminor stack data. This choice is compatible with our pointer and memory model, 
where offsets in pointers are signed integers, and it simplifies the soundness proof.) 

The generation of Mach code is straightforward, getslot and setslot Linear in- 
structions are rewritten as follows: 



[getslot (local (t, 5), r)] 
[getslot (outgoing(T, 5), r)] 

[getslot (incoming(r, S), r)] 
[setslot (r, local(T, 3))J 
|setslot(r, outgoing(T, S))} 



getstack(r, /i(local(r, S)),r) 
getstack(r, Zi(outgoing(T, 5)),r) 
getparent(r, Z\(outgoing(r, 5)),r) 
setstack(r, r, Zi(local(r, S))) 
setstack(r, t, Zi(outgoing(r, 5))) 



Moreover, instructions to save and restore the callee-save registers ri,...,rn used in 
the function are inserted as function epilogue and prologue: 



[return] = . . . getstack(r(r,), Zi(rj), r,) . . . return 

IFj = {code = . . . setstack(ri, r(rj), Zi(rj)) . . . [F.code]; . . .} 



Moreover, the translation of a function [F] must fail if the "frame" part of the activar- 
tion record (everything except the Cminor stack data) is bigger than 2^^ bytes. Indeed, 

in this case the signed integer offsets used to access locations within the activation 
record would overflow, making it impossible to access some frame components. 
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12.3 Semantic preservation 

While the code transformation outhned above is simple, its proof of correctness is 
surprisingly difficult, mostly because it entails reasoning about memory separation 
properties (between areas of the activation record and between different activation 
records). To manage this complexity, we broke the proof in two sub-proofs, using an 
alternate semantics for Mach to connect them. 

In this alternate semantics, the "frame" part of activation records does not reside 
in memory; instead, its contents are recorded separately in a component of regular 
states. This environment ^ maps (type, byte offset) pairs to values, taking overlap into 
account in the style of the maps L from locations to values introduced in section 8.1. 
Each function activation has its own frame environment ^, saved and restored from 
the call stack S. In the alternate semantics for Mach, the getstack and setstack 
instructions are reinterpreted as accesses and updates to ^, and getparent instructions 
as accesses to X'.top.^. 

The first part of the proof shows a simulation diagram of the "plus" kind between 
executions of the original Linear code and executions of the generated Mach code using 
the alternate semantics outlined above. Thanks to the alternate semantics, memory 
states are identical between pairs of matching Linear and Mach states. The main in- 
variant to be enforced is agreement between, on the Linear side, the location states L 
and i7.top.L of the current and calling functions and, on the Mach side, the register 
state R and the frame states $ and i7'.top.<?. This agreement captures the following 
conditions: 

— L{r) — R{r) for all registers r; 

— L{s) = <P(r, Z\(s)) for all local and outgoing stack slots s of type t used in the 

current function; 

— L{s) = E' .to'p.${T,A(s)) for all incoming slots s of type r used in the current 

function; 

— top. L(r) = <P(r, A{r)) for all callee-save registers r of type t used in the current 
function; 

— X'. top. L(r) = -R(r) for all callee-save registers r not used in the current function. 

The preservation of agreement during execution steps follows mainly from separation 
properties between the various areas of activation records. 

The second part of the proof shows a lock-step simulation result between executions 
of Mach programs that use the alternate and standard semantics, respectively. Here, 
the memory states M (of the alternate semantics) and M' (of the standard semantics) 
differ: for each activation record h, the block M'(6) is larger than the block M(6) (be- 
cause it contains additional "frame" data), but the two blocks have the same contents 
on byte offsets valid for M(6) (these offsets correspond to the Cminor stack data). We 
capture this connection between memory states by the "memory extension" ordering 
M < M': 

M <M' Vk, 6, 6, V, loacl(K, M, b, 5) = [v\ =^ loacl(K, M', b, 6) = [vj 

Leroy and Blazy [59, section 5.2] study the properties of this relation between memory 
states and shows that it commutes nicely with the store, alloc and free operations 
over memory states. 
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Another invariant that wc must maintain is that the contents of the block M'{b) 
agree with the frame state $ of the alternate Mach semantics: 

F,M,M' \=$ r^b,6 =^ & is valid in M a£(M,6) = 

A C{M', b) = F.stack_low A W(M', &) > 
A S = F.stack_low mod 2^^ 
A Vt, n, F.stack_low < n A n + \t\ < =^ 
load(K;(r),M',6,n) = ['^(r, n)J 

The invariant between alternate Mach states and standard Mach states is, then, of 
the following form: 

r ~ F,M,M' \=0 a,d M <M' E' -< a 

S{E, F, ptr(fT, S), c, R, ^, M) ~ S{S' , F, ptr(<7, 5), c, R, M') 

The notation S' < a means that the stack blocks cri , . . . , <Tn appearing in the call stack 
E' arc such that a > ai > . . . > an and arc therefore pairwiso distinct. To complete the 
proof of simulation between the alternate and standard semantics for Mach, we need 
to exploit well-typedness properties. The frame environments <P used in the alternate 
semantics of Mach satisfy the classic "good variable" property ^{(r, 5) <— v}{t^ 5) = v 
regardless of whether the value v matches the claimed type t. However, once frames are 
mapped to memory locations, writing e.g. a float value with memory quantity int32 
and reading it back with the same memory quantity results in xindef , not the stored 
value. More precisely: 

Lemma 8 Assume storestack(T, M, cr, 5, = [M'J . Then, 
loadstack(T, M',ct, (5) = \y\ if and only if v : t (as defined in section 3.2). 

Therefore, wc need to exploit the wcU-typcdncss of the Mach code in a trivial int-or- 
f loat type system in the style of section 4.3 to ensure that the values v stored in a 
stack location of type t always semantically belong to type r. We say that a Mach 
alternate execution state S is well typed if 

— all functions and code sequences appearing in S are well typed; 

— all abstract frames <^ appearing in S are such that Vr, 5, ${t, 5) : t; 

— all register states R appearing in S are such that Vr, R{r) : r(r). 

We can then prove a "subject reduction" lemma showing that well-typedness is pre- 
served by transitions. 

Lemma 9 If G \- S ^ S' in the Mach abstract semantics and S is well typed, then S' 
is well typed. 

Combining this well-typedness property with the invariant between alternate and stan- 
dard Mach states, we prove the following lock-step simulation result between the two 
Mach semantics. 

Lemma 10 If G \- Si ^ S'l in the Mach abstract semantics and Si is well typed 
and Si ~ S'2, there exists S'2 such that G \- S2 S2 in the standard semantics and 
Si ~ S2. 

Semantic preservation for the compiler pass that constructs activation records then 
follows from the two sub-proofs of simulation outlined above. 
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Machine instructions: 



add 


blr 


f cmpu 


Ifsx 


nand 


stfdx 


addl 


bt 


fdiv 


lha 


nor 


stf s 


addls 


cmplw 


fmadd 


lhax 


or 


stf sx 


addze 


cmplwi 


fmr 


Ihz 


ore 


sth 


and. 


cmpw 


fmsub 


Ihzx 


ori 


sthx 


andc 


cmpwi 


fmul 


Iwz 


oris 


stw 


andi . 


cror 


fneg 


Iwzx 


rlwinm 


stwx 


andis. 


divw 


f rsp 


mf cr 


slw 


subf c 


b 


divwu 


f sub 


mflr 


sraw 


subf ic 


bctr 


eqv 


Ibz 


mr 


srawi 


xor 


bctrl 


extsb 


Ibzx 


mtctr 


srw 


xori 


bf 


extsh 


Ifd 


mtlr 


stb 


xoris 


bl 


f abs 


Ifdx 


mulli 


stbx 




bs 


fadd 


Ifs 


mullw 


stfd 





Macro-instructions: 

allocframe, freeframe: allocation and deallocation of a stack frame 
If i: load a floating-point constant in a float register 
f cti, f ctiu: conversion from floats to integers 
ictf , iuctf : conversion from integers to floats 



Fig. 17 The subset of PowerPC instructions used in Compcert. 



13 The output language: PowerPC assembly language 

The target language for our compiler is PPC, an abstract syntax for a subset of the Pow- 
erPC Eissembly language [47], comprising 82 of the 200-1- instructions of this processor, 
plus 7 macro-instructions. The supported instructions are listed in figure 17. 



13.1 Syntax 



The synta^x of PPC has the following shape: 

Integer registers: rj 

Float registers: r j 

Condition bits: rc 

Constants: est 

Instructions: i 



RO 1 Rl I . . . I R31 

FO I Fl I . . . I F31 

CRO I CRl I CR2 I CR3 

n I lol6(i(i + 5) \ hil6(irf + 5) 

\&hel{lhl) I bt(rc, Ibl) 

I addi(rj,r-, est) 



add(ri,r',r''^ 



fadd(r^, r},r}') 



Internal functions: F ::= i* 



PPC is an assembly language, not a machine language. This is apparent in the 
use of symbolic labels in branch instructions such as bt, and in the use of symbolic 
constants lol6{id + 6) and hil6(id -|- 6) as immediate operands of some instructions. 
(These constants, resolved by the linker, denote the low-order and high-order 16 bits 
of the memory address of symbol id plus offset S.) 
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Moreover, PPC features a handful of macro-instruetions that expand to canned 
sequences of actual instructions during pretty-printing of the abstract syntax to con- 
crete assembly syntax. These macro-instructions include allocation and deallocation 
of the stack frame (mapped to arithmetic on the stack pointer register), conversions 
between integers and floats (mapped to memory transfers and complicated bit-level 
manipulations of IEEE floats), and loading of a floating-point literal (mapped to a 
load from a memory-allocated constant). The reason for treating these operations as 
basic instructions is that the memory model and the axiomatization of IEEE float 
arithmetic that we use are too abstract to verify the correctness of the corresponding 
canned instruction sequences. (For example, our memory model cannot express that 
two abstract stack frames are adjacent in memory.) We leave this verification to future 
work, but note that these canned sequences of instructions are identical to those used 
by GCC and other PowerPC compilers and therefore have been tested extensively. 



13.2 Semantics 

Program states in PPC arc pairs S ::= (-R, M) of a memory state M and a register state 
R associating values to the processor registers that we model, namely integer registers 
Vi, floating-point registers rj, bits to 3 of the condition register CR, the program 
counter PC, and the special "link" and "counter" registers LR and CTR. 

The core of PPC's operational semantics is a transition function T{i, S) = [S' ] that 
determines the state S' after executing instruction i in initial state S. In particular, the 
program counter PC is incremented (for fall-through instructions) or set to the branch 
target (for branching instructions). We omit the definition of T in this article, as it is a 
very large case analysis, but refer the reader to the Coq development for more details. 
The semantics of PPC, then, is defined by just two transition rules: 

i?(PC) =ptr(6,n) G(6) = [internal (c)J c#n = [ij 
T{i,{R,M)) = [{R',M')\ 

Gh (7?,M) A (i?',M') 

R{PC) = ptr(6, 0) G{b) = [external(_Fe)J 
extcall_arguments(_R, M, Fe.sig) — [uj Fe{v) v 
R' = R{PC <- i?(LR), extcalljresult(J'e.sig) <- v} 

G h {R, M) {R', M) 

The first rule describes the execution of one PPC instruction within an internal function. 

The notation c#n denotes the n^^ instruction in the list c. The first three premises 
model abstractly the act of reading and decoding the instruction pointed to by the 
program counter PC. For simplicity, we consider that all instructions occupy one byte 
in memory, so that incrementing the program counter corresponds to branching to the 
next instruction. It would be easy to account for variable-length instructions (e.g. 4 
bytes for regular instructions, for labels, and 4n bytes for macro- instructions). The 
second rule describes the big-step execution of an invocation of an external function. 
The extcall_arguments function extracts the arguments to the call from the registers 
and stack memory locations prescribed by the signature of the external call; likewise, 
extcall_resiilt denotes the register where the result must be stored. Conventionally, 
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the address of the instruction following the call is found in register LR; setting PC to 
7?(LR) therefore returns to the caller. 

13.3 Determinism 

Determinism of the target language plays an important role in the general framework 
described in section 2. It is therefore time to see whether the semantics of PPC is 
deterministic. In the general case, the answer is definitely "no": in the rule for external 
function calls, the result value v of the call is unconstrained and can take any value, 
resulting in different executions for the same PPC program. However, this is a form of 
external nondeterminism: the choice of the result value v is not left to the program, 
but is performed by the "world" (operating system, user interaction, . . . ) in which the 
program executes. As we now formalize, if the world is deterministic, so is the semantics 
of PPC. A deterministic world is modeled as a partial function W taking the identifier 
of an external call and its argument values and returning the result value of the call as 
well as an updated world W'. A finite trace t or infinite trace T is legal in a world W, 
written W \= t or W \= T, if the result values of external calls recorded in the trace 
agree with what the world W imposes: 

W{id, v) = [v, W'\ W' \= t 
W ^ id(v^ v).t 
W{id,v) = [v,W'\ W'\=T 

W \= id{v^v).T 
We extend this definition to program behaviors in the obvious way: 

W^t W^T W^t 

W \= converges (t, n) W \= diverges(r) W \= goeswrong(t) 

We could expect that a PPC program has at most one behavior B that is legal in a 
deterministic initial world W. This is true for terminating behaviors, but for diverging 
behaviors a second source of apparent nondeterminism appears, caused by the coin- 
ductive definition of the infinite closure relation G h 5 — » oo in section 3.5. Consider a 
program P that diverges silently, such as an infinite empty loop. According to the defi- 
nitions in section 3.5, this program has behaviors diverges (T) for any finite or infinite 
trace T, not just T = e as expected. Indeed, no finite observation of the execution of 
P can disprove the claim that it executes with a trace T e. However, using classical 
logic (the axiom of excluded middle), it is easy to show that the set of possible traces 
admits a minimal element for the prefix ordering between traces T. This minimal trace 
is infinite if the program is reactive (performs infinitely many I/O operations separated 
by finite numbers of internal computation steps) and finite otherwise (if the program 
eventually loops without performing any I/O). By restricting observations to legal be- 
haviors with minimal traces, we finally obtain the desired determinism property for 
PPC. 

Theorem 7 Let P be a PPC program, W be a deterministic initial world, and P ii- B 

and P ij. B' be two executions ofP.IfW\=B and W \= B' and moreover the traces 
of B and B' are minimal, then B' = B up to bisimilarity of infinite traces. 
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14 Generation of PowerPC assembly language 
14.1 Code generation 

The final compilation pass of Compcert translates from Mach to PPC by expanding 
Mach instructions into canned sequences of PPC instructions. For example, a Mach 
conditional branch cond{cond,r,ltrue) becomes a cmplw, cmplwi, cmpw, cmpwi or fcmp 
instruction that sets condition bits, followed in some cases by a cror instruction to 
merge two condition bits, followed by a bt or bf conditional branch. Moreover, Mach 
registers are injected into PPC integer or float registers. 

The translation deals with various idiosyncrasies of the PowerPC instruction set, 
such as the limited range for immediate arguments to integer arithmetic operations, 
and the fact that register RO reads as when used as argument to some instructions. 
Two registers (R2 and F13) are reserved for these purposes. The translation is presented 
as a number of "smart constructor" functions that construct and combine sequences of 
PPC operations. To give the flavor of the translation, here are the smart constructors 
for "load integer immediate" and "add integer immediate" . The functions low(n) and 
high(n) compute the signed 16-bit integers such that n = high(n) x 2^^ + low(n). 

{addi(r, RO, n) if high(n) = 0; 

addis(r, RO, high(n)) if low(n) = 0; 

addis(r, RO, high(n)); ori(r, r, low(n)) otherwise 
' loadimm(R2, n); add(rd, r-g, R2) if = RO or rg = RO; 

addi(r(i, rg, n) if high(n) = 0; 

addis(r(j,rg,high(n)) if low(n) = 0; 

^ addis(r(j, rg, high(n)); addi(rrf, r<j, low(n)) otherwise 

Just as the generation of Mach code must fail if the activation record is too large 
to be addressed by machine integers (section 12.2), the PPC generator must fail if 
the translation of a function contains 2^^ or more instructions, since it would then 
be impossible to address some of the instructions via a signed 32-bit ofTset from the 
beginning of the function. 



addimm(r(i, rg, n) = < 



14.2 Semantic preservation 

Semantic preservation for PPC generation is proved using a simulation diagram of the 
"option" type. The two main invariants to be preserved are: 

1. The PC register contains a pointer ptr(6, 5) that corresponds to the Mach function 
F and code sequence c currently executing: 

F,c~ptr(6,(5) =^ (3(6) = Linternal(F)J A [c] = suffix ([F], 5) 

2. The Mach register state R and stack pointer a agree with the PPC register state 
R': 

i^, o- ~ i^' =^ i^'(Rl) = a Vr, R{r) = R'{r) 

where f denotes the PPC register associated with the Mach register r. (Conven- 
tionally, the Rl register is used as the stack pointer.) 
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More precisely, matching between a Mach execution state and a PPC execution state 
is defined as follows: 

S wf F; c ~ _R'(PC) R,a ^ R' 
S(E,F,G,c,R,M) ~ {R',M) 
E wf il'(PC) = ptr(6,0) G{h) = [FrfJ i?'(LR) = r.top.ra R,EXo^.a ~ R' 

C(E,Fd,R,M) ~ {R',M) 
S wf i?'(PC) = r.top.ra R, r.top.cr ~ Jl' 

7^(^, R,M) - (i?',M) 

The invariant S wf over Mach call stacks is defined as F, c ~ ra for all F{F, a, ra, c) £ 
17. The case for call states reflects the convention that the caller saves its return address 
in register LR before jiiinpiiig to the first instruction of the callee. As mentioned in 
section 12.1, the Mach semantics is parameterized by an oracle retaddr(F, c, ra) that 
guesses the code pointer ra pointing to the PPC code corresponding to the translation 
of Mach code c within function F. We construct a suitable oracle by noticing that 
the translation of a Mach code ii . . . in is simply the concatenation . . . of the 
translations of the instructions. Therefore, the offset of the return address ra is simply 
the position of the suffix [c] of \F\. It is always uniquely defined if c is a suffix of 
F.code, which the Mach semantics guarantees. The simulation diagram is of the "star'" 
kind because Mach transitions from return states Tl. to regular states S correspond 
to zero transitions in the generated PPC code. The absence of infinite stuttering is 
trivially shown by associating measure 1 to return states and measure to regular and 
call states. The proof of the sirrmlation diagram is long because of the high number of 
cases to consider but presents no particular difficulties. To resison more easily about 
the straight-line execution of a sequence of non-branching PPC instructions, we make 
heavy use of the following derived execution rule: 

i?o(PC) = ptr(6,(5) G(6) = [internal(c)J suf f ix(c, 5) = ii . . . in.c' n>0 
for all fc e {1, . . . ,n}, T(h, Mfe_i)) = L(i?fe,Mfc)J and Jifc(PC) = Jifc_i(PC) + 1 

Gh{Ro,MQ) ^+ {Rn,Mn) 

15 The Coq development 

We now discuss the Coq implementation of the algorithms and proofs described in this 
paper. 

15.1 Specifying in Coq 

The logic implemented by Coq is the Calculus of Inductive and Coinductive Con- 
structions (CICC), a very powerful constructive logic. It supports equally well three 
familiar styles of writing specifications: by functions and pattern-matching, by induc- 
tive or coinductive predicates representing inference rules, and by ordinary predicates 
in first-order logic. All three styles are used in the Compcert development, resulting 
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in specifications and statements of theorems that remain quite close to what can be 
found in programming language research papers. 

CICC also features higher-order logic, dependent types, a hierarchy of universes to 
enforce predicativity, and an ML-style modulo system. These advanced logical features 
are used sparingly in our development: higher-order functions and predicates are used 
for iterators over data structures and to define generic closure operations; simple forms 
of dependent types are used to attach logical invariants to some data structures and 
some monads (see section 6.4); coinduction is used to define and reason about infi- 
nite transition sequences; parameterized modules are used to implement the generic 
dataflow solvers of section 7.1. Most of our development is conducted in first-order 
logic, however. 

Two logical axioms (not provable in Coq) are used: function cxtciisionality (if 
Vx, f{x) = g{x), then the functions / and g are equal) and proof irrelevance (any 
two proof terms of the same proposition P are equal). It might be possible to avoid 
using these two axioms, but they make some proofs significantly easier. The proofs 
of the semantic preservation theorems are entirely constructive. The only place where 
classical logic is needed, under the form of the axiom of excluded middle, is to show 
the existence of minimal traces for infinite executions (see section 13.3). This set of 
three axioms is consistent with the predicative version of Coq's logic that we use. 

15.2 Proving in Coq 

While Coq proofs can be checked a posteriori in batch mode, they are developed inter- 
actively using a number of tactics as elementary proof steps. The sequence of tactics 
used constitutes the proof script. Building such scripts is surprisingly addictive, in a 
videogame kind of way, but reading them is notoriously difficult. We were initially 
concerned that adapting and reusing proof scripts when specifications change could 
be difiicult, forcing many proofs to be rewritten from scratch. In practice, careful de- 
composition of the proofs in separate lemmas enabled us to reuse large parts of the 
development even in the face of major changes in the semantics, such as switching from 
the "mixed-step" semantics described in [57] to the small-step transition semantics de- 
scribed in this paper. 

Our proofs frequently use the basic proof automation facilities provided by Coq, 
mostly eauto (Prolog-style resolution), omega (Presburger arithmetic) and congruence 
(a decision procedure for ground equalities with uninterpreted symbols). However, 
these tactics do not combine automatically, and significant manual massaging of the 
goals is necessary before they apply. The functional induction and functional inversion 
mechanisms of Coq 8.1 [6] helped reason about functions defined by complex pattern- 
matching. 

Coq also provides a dedicated Ltac language for users to define their own tactics. 
We used this facility occasionally, for instance to define a "monadic inversion" tactic 
that recursively simplifies hypotheses of the form (do x <- a;b) s = OK(s',r) into 
3si. 3 X. a s = OK(si, x) A 6 si = DK(s', r). There is no doubt that a Coq expert could 
have found more opportunities for domain-specific tactics and could have improved 
significantly our proof scripts. 

As mentioned earlier, most of the development is conducted in first-order logic, 
suggesting the possibility of using automated theorem provcrs such as SMT solvers. 
Preliminary experiments with using SMT solvers to prove properties of the Compcert 
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Fig. 18 Size of the development (in non-blank, non-comment lines of code). 



memory model [59] indicate that many but not all of the lemmas can be proved au- 
tomatically. While fully automated verification of a program like Compcert appears 

infcasible with today's technology, we expect that our interactive proof scripts would 
shrink significantly if Coq provided a modern SMT solver as one of its tactics. 



15.3 Size of the development 

The size of the Coq development can bo estimated from the line counts given in fig- 
ure 18. The whole development represents approximately 37000 fines of Coq (excluding 
comments and blank lines) plus 1000 lines of code directly written in Caml. The overall 
efi'ort represents approximately 2 person-years of work. 

The Coq datatype and function definitions that implement the compiler itself (col- 
umn "Code in Coq" in figure 18) account for 14% of the source. In other terms, the Coq 
verification is about 6 times larger than the program being verified. The remaining 86% 
comprise 10% of specifications (mostly, the operational semantics for the source, target 
and intermediate languages), 21% of statements of lemmas, theorems and supporting 
definitions, 44% of proof scripts and 8% of directives, module declarations, and custom 
proof tactics. 

The sizes of individual passes are relatively consistent: the most difficult passes 
(RTL generation, register allocation, spilling and reloading, and layout of the activa- 
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tion record) take between 2000 and 3000 lines each, while simpler passes (constant 
propagation, CSE, tunneling, linearization) take less than 1500 lines each. One outlier 
is the PPG code generation pass which, while conceptually simple, involves large defini- 
tions and proofs by case analysis, totaling more than 3300 lines. Among the supporting 
libraries, the formalizations of machine integer arithmetic and of the memory model 
are the largest and most difficult, requiring 1900 and 2300 lines respectively. 

Checking all the proofs in the development takes about 7.5 minutes of CPU time 
on a 2.4 GHz Intel Core 2 processor equipped with 4 Gb of RAM and running 64- 
bit Linux. The version of Coq used is 8.1pl3. Parallel maie with 2 cores results in a 
wall-clock time of 4.5 minutes. 



16 Experimental results 

16.1 Extracting an executable compiler 

As mentioned in the Introduction, the verified parts of the Compcert compiler are 
programmed directly in Coq, then automatically translated to executable Caml code 
using Coq's extraction facility [61,62]. To obtain an executable compiler, this extracted 
Caml code is combined with: 

— A compiler front-end, translating the Clight subset of C to Cminor. This front-end 
is itself extracted from a Coq development. An earlier version of this front-end 
compiler is described in [15]. 

— A parser for C that generates Clight abstract syntax. This parser is built on top of 
the GIL Ubrary [77]. 

— Hand-written Caml implementations of the heuristics that we validate a posteriori 
(graph coloring, RTL type reconstruction, etc), of the pretty-printer for PowerPC 

assembly code, and of a cc-style compiler driver. 

The resulting compiler runs on any platform supported by Caml and generates Pow- 
erPC code that runs under MacOS X. While the soundness proof for Compcert does 
not account for separate compilation and assumes that whole programs are compiled 
at once, the compiler can be used to separately compile G source files and link them 
with precompiled libraries, which is convenient for testing. (The calling conventions 
implemented by Compcert are compatible enough with the standard PowerPC ABI to 
support this.) 

Program extraction performs two main tasks. First, it eliminates the parts of Coq 
terms that have no computational content, by a process similar to program slicing. For 
instance, if a data structure carries a logical invariant, every instance of this structure 
contains a proof term showing that the invariant is satisfied. This subterm does not 
contribute to the final result of the program, only to its correctness, and is therefore 
eliminated by extraction. The second task is to bridge the gap between Coq's rich type 
system and Caml's simpler Hindley-Milner type system. Uses of first-class polymor- 
phism or general dependent types in Coq can lead to programs that are not typeable 
in Caml; extraction works around this issue by inserting unsafe Caml coercions, locally 
turning off Caml's type checking. This never happens in Compcert, since the source 
Coq code is written in pedestrian ML style, using only Hindley-Milner types. 

Generally speaking, the Caml code extracted from the Compcert development looks 
like what a Caml programmer would write if confined to the purely functional subset 
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Fig. 19 Execution times of compiled code. Times are given relative to those obtained with 
gcc -00. Lower percentages and shorter bars mean faster. 



of the language. There are two exceptions. The first is related to the handling of 
default cases in Coq pattern-matching. Consider a data type with 5 constructors A 
to E, and a pattern- matching (A, A) -> x I (_, _) -> y. Internally, Coq represents 
this definition by a complete matching having 5x5 cases, 24 of which are y. Extraction 
does not yet re-factor the default case, resulting in 24 copies of the code y. On large 
data types, this can lead to significant code explosion, which we limited on a case by 
case basis, often by introducing auxiliary functions. 



The other problem with extraction we encountered is the r7-expansions that ex- 
traction sometimes performs in the hope of exposing more opportunities for program 
slicing. These expansions can introduce inefficiencies by un-sharing computations. Con- 
sider for example a curried function of two arguments x and y that takes x, performs 
an expensive computation, then returns a function Xy . . .. After 77-expansion, this ex- 
pensive computation is performed every time the second argument is passed. We ran 
into this problem twice in Compcert. Manual patching of the extracted Caml code was 
necessary to undo this "optimization" . 
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16.2 Benchmarks 

Performance of the generated code can be estimated from the timings given in fig- 
ure 19. Since Compcort accepts only a subset of the C language (excluding variadic 
functions and long long arithmetic types, for instance), standard benchmark suites 
cannot be used, and we reverted to a small home-grown test suite. The test programs 
range from 50 to 3000 lines of C code, and include computational kernels (FFT, N- 
body, etc.), cryptographic primitives (AES, SHAl), text compression algorithms, a 
virtual machine interpreter, and a ray tracer derived from the ICFP 2000 program- 
ming contest. The PowerPC code generated by Compcert was benchmarked against 
the code generated by GCC version 4.0.1 at optimization levels 0, 1 and 2. (Higher 
GCC optimization levels make no significant differences on this test suite.) For the 
purpose of this benchmark, Compcert was allowed to recognize the fused multiply-add 
and multiply-sub PowerPC instructions. (These instructions are normally not used in 
Compcert because they produce results different from a multiply followed by an add 
or sub, but since GCC uses them nonetheless, it is fair to allow Compcert to do so as 
well.) Measurements were performed on an Apple PowerMac workstation with two 2.0 
GHz PowerPC 970 (G5) processors and 6 Gb of RAM, running MacOS 10.4.11. 

As the timings in figure 19 show, Compcert generates code that is more than twice 
as fast as that generated by GCC without optimizations, and competitive with GCC 
at optimization levels 1 and 2. On average, Compcert code is only 7% slower than 
gcc -01 and 12% slower than gcc -02. The test suite is too small to draw definitive 
conclusions, but these results strongly suggest that while Compcert is not going to win 
any prize in high performance computing, the performance of generated programs is 
adequate for critical embedded code. 

Compilation times are higher for Compcert than for GCC but remain acceptable; 
to compile the 3000-line ray tracer, Compcert takes 4.6 s while gcc -01 takes 2.7 s. 
There are several possible reasons for this slowdown. One is that Compcert proceeds 
in a relatively high number of passes, each of which reconstructs entirely the repre- 
sentation of the function being compiled. Another is the use of purely functional data 
structures (balanced trees) instead of imperative data structures (bitvectors, hash tar 
bles). Wo wore careful, however, to use functional data structures with good asymptotic 
complexity (mainly AVL trees and radbc-2 trees), which cost only a logarithmic factor 
compared with imperative data structures. 



17 Discussion and perspectives 

We now discuss some of the design choices and limitations of our work and outline 
directions for future work. 



17.1 On Cminor as a target language 

The Cminor intermediate language was designed to allow relatively direct translation 
of a large subset of the C language. In particular, the memory model closely matches 

that of C, and the block/exit mechanism makes it easy to translate C loops (including 
breaik and continue) in a compositional manner. The C feature that appears most diffi- 
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cult to support is variadic functions; this is not surprising given that variadic functions 
account for most of the complexity of function calling conventions in C compilers. 

Other features of C could be supported with small extensions to Cminor and the 
Compcort back-end. For instance, Cminor currently performs all floating-point arith- 
metic in double precision, but it is planned to add single-precision float operators to 
better support ISO C's floating-point semantics. Also, large switch statements could 
be compiled more efficiently if multi-way branches (jump tables) were added to RTL 
and later intermediate languages. 

Less obviously, Cminor can also be used as a target language when compiling higher- 
level source languages. Function pointers and tail-call optimization are supported, en- 
abling the compilation of object-oriented and functional languages. Exceptions in the 
style of ML, C++ or Java are not supported natively in Cminor but could be encoded 
(at some run-time cost) either as special return values or using continuation-passing 
style. For the former approach, it could be worthwhile to add functions with multiple 
return values to Cminor. 

For languages with automatic memory management, Cminor provides no native 
support for accurate garbage collection: mechanisms for tracking GC roots through 
register and stack allocation in the style of C — [81] are not provided and appear 
difficult to specify and prove correct. However, the Cminor producer can explicitly 
register GC roots in Cminor stack blocks, in the style of Henderson [42] . Zaynah Dargaye 
has prototyped a verified front-end compiler from the mini-ML functional language to 
Cminor that follows this approach [27]. 



17.2 On retargeting 

While some parts of the Compccrt back-end are obviously specific to the PowerPC (e.g. 
generation of assembly language, section 14), most parts are relatively independent of 
the target processor and could conceivably be reused for a different target. To make 
this claim more precise, we experimented with retargeting the Compcort back-end for 
the popular ARM processor. The three aspects of this port that required significant 
changes in the back-end and in its proof are: 

— Reflecting the differences in instruction sets, the types and semantics of machine- 
specific operators, addressing modes and conditions (section 5.1) change. This im- 
pacts the instruction selection pass (section 5) but also the abstract interpretation 
of these operators performed by constant propagation (section 7.2). 

— Calling conventions and stack layout differ. Most differences are easy to abstract 
over, but the standard ARM calling convention could not be supported: it requires 
that floats are passed in pairs of integer registers, which our value and memory 
model cannot express yet. Nonstandard calling conventions, using float registers to 
pass floats, had to be used instead. 

— ARM has fewer registers than PowerPC (16 integer registers and 8 float registers 
instead of 32 and 32). Consequently, we had to reduce the number of registers 
reserved to act as temporary registers. This required some changes in the spilling 
and reloading pass (section 11). 

Overall, the port of Compcort and its proof to the ARM processor took about 3 weeks. 
Three more weeks were needed to revise the modular structure of the Coq development. 
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separating the processor-specific parts from the rest, adapting the PowerPC and ARM- 
specific parts so that they have exactly the same interface, and making provisions for 
supporting other target processors in the future. Among the 37500 lines of the initial 
development, 28000 (76%) were found to be processor- independent and 8900 (24%) 
ended up in PowerPC-specific modules; an additional 8800 lines were added to support 
the ARM processor. 



17.3 On optimizations 

As mentioned in the Introduction, the main objective for the Compcert project was to 
prove end-to-end semantic preservation. This led us to concentrate on non-optimizing 
transformations that are required in a compiler and to spend less time on optimizations 
that are optional. Many interesting optimizations [73, 1] remain to be proved correct 
and integrated in Compcert. 

In separate work not yet part of Compcert, Jean-Baptistc Tristan verified two 
additional optimizations: instruction scheduling by list scheduling and trace scheduling 
[32] and lazy code motion (LCM) [51]. These optimizations are more advanced than 
those described in section 7, since they move instructions across basic blocks and even 
across loop boundaries in the case of LCM. In both cases, Tristan used a translation 
validation approach (see section 2.2.2) where the code transformation is performed 
by untrusted Caml code, then verified a posteriori using a verifier that is formally 
proved correct in Coq. In the case of instruction scheduling, validation is performed 
by symbolic execution of extended basic blocks [93]. For LCM, validation exploits 
equations between program variables obtained by an available expressions analysis, 
combined with an anticipability analysis [94]. On these two examples, the verified 
translation validation approach was effective, resulting in relatively simple semantic 
correctness proofs that are insensitive to the many heuristics decisions taken by these 
two optimizations. We conjecture that for several other optimizations, the verified 
translation validation approach is simpler than proving directly the correctness of the 
optimization. 

Many advanced optimizations are formulated in terms of static single eissignment 
(SSA) representation rather than over classic intermediate representations like those 
currently used in Compcert. SSA enables more efficient static analyses and sometimes 
simpler code transformations. A typical SSA-based optimization that interests us is 
global value numbering [88] . Since the beginning of Compcert we have been considering 
using SSA-based intermediate languages, but were held off by two difficulties. First, 
the dynamic semantics for SSA is not obvious to formalize (see [17,10,84] for various 
approaches). Second, the SSA property is global to the code of a whole function and 
not straightforward to exploit locally within proofs. Functional representations such 
as A-normal forms could offer some of the benefits of SSA with clearer semantics. In 
a translation validation setting, it might not be necessary to reason directly over the 
SSA form: the untrusted optimizations could convert to SSA, use efficient SSA-based 
algorithm and convert out of SSA; the validator, which is the only part that needs 
proving, can still use conventional RTL-like representations. 
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17.4 On memory 

Whether to give formal semantics to imperative languages or to reason over pointer- 
based programs and transformations over these programs, the memory model is a 
crucial ingredient. The formalization of memory used in Compcert can be extended 
and refined in several directions. 

The first is to add memory allocation and deallocation primitives to Cminor, in the 
style of C's malloc and free. Both can be implemented in Cminor by carving sub-blocks 
out of a large global array, but primitive support for these operations could facilitate 
reasoning over Cminor programs. Supporting dynamic allocation is easy since it maps 
directly to the alloc function of our memory model. (This model does not assume that 
allocations and deallocations follow a stack-like discipline.) Dynamic, programmer- 
controlled deallocation requires more care: mapping it to the free operation of our 
memory model opens up the possibility that a Cminor function explicitly deallocates 
its own stack block. This could invalidate semantic preservation: if the Cminor function 
does not use its stack block and does not terminate, nothing wrong happens, but if 
some of its variables are spilled to memory, the corresponding Mach code could crash 
when accessing a spilled variable. Explicit deallocation of stack frames must therefore 
be prevented at the Cminor level, typically by tagging memory blocks as belonging to 
the stack or to the heap. 

Another limitation of our memory model is that it completely hides the byte- 
level representation of integers and floats in memory. This makes it impossible to 
express in Cminor some C programming idioms of dubious legality but high practical 
usefulness such as copying bytc-pcr-byte an arbitrary data structure to another of the 
same layout (in the style of memmove). Doing so in Cminor would fill the destination 
structure with undef values. As mentioned in section 13.1, this feature of our memory 
model also prevents us from reasoning about machine code that manipulates the IEEE 
representation of fioats at the bit level. As discussed in [59, section 7], a strong reason 
for hiding byte-level in-memory representations is to ensure that pointer values cannot 
be forged from integers or floats; this guarantee plays a crucial role in proving semantic 
preservation for certain memory transformations. A topic for future work is to refine the 
memory model to obtain the best of both worlds: unforgeable pointers and byte-level 
access to the representations of integers and floats. 

Compared with "real" memory implementations or even with the version of our 
memory model presented in [59] , the memory model currently used in Compcert makes 
two simplifying assumptions: (1) free never fails, therefore allowing repeated deallo- 
cation of a given block; (2) alloc never fails, therefore modeling an infinite memory. 
Assumption (1) is not essential, and our proofs extend straightforwardly to showing 
that the compiler never inserts a double free operation. Assumption (2) on infinite 
memory is more difficult to remove, because in general a compiler does not preserve 
the stack memory consumption of the program it compiles. It is easy to show that the 
generated code performs exactly the same dynamic allocations and deallocations as the 
source program; therefore, heap memory usage is preserved. However, in Compcert, the 
sizes of stack blocks can increase arbitrarily between Cminor and PPC, owing to the 
spilling of Cminor variables to the stack described in section 12. A Cminor program 
that executes correctly within N bytes of stack space can therefore be translated to a 
PPC program that runs out of stack, significantly weakening the semantic preservation 
theorems that we proved. 
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In other words, while heap memory usage is clearly preserved, it seems diffieult to 
prove a bound on stack usage on the source program and expect this resource certi- 
fication to carry over to compiled code: stack consumption, like execution time, is a 
program property that is not naturally preserved by compilation. A simpler alternative 
is to establish the memory bound directly on the compiled code. If recursion and func- 
tion pointers are not used, which is generally the case for critical embedded software, 
a simple, verified static analysis over Mach code that approximates the call graph can 
provide the required bound N on stack usage. We could, then, prove a strong semantic 
preservation theorem of the form "if the PPC stack is of size at least N, the generated 
PPC code behaves like the source Cminor program". For programs that use recursion 
or function pointers, the issue remains open, however. 

17.5 On the multiplicity of passes and intermediate languages 

The number of compilation passes in Compcert is relatively high by compiler standards, 
but not shockingly so. The main motivation here was to have passes that do exactly one 
thing each, but do it well and in a complete manner. Combining several passes together 
tends to complicate their proofs supcr-lincarly. For example, the first published version 
of Compcert [57] performed register allocation, reloading, spilling and enforcement of 
calling conventions all in one pass. Splitting this pass in two (register allocation in 
section 8, spilling in section 11) resulted in a net simplification of the proofs. 

What is more surprising is the high number of intermediate languages involved in 
Compcert: with the exception of dataflow-based optimizations (constant propagation, 
CSE, and tunneling), each pass introduces a new intermediate language to use as its 
target language. Many of these intermediate languages are small variations on one 
another. Yet we found it necessary to define each intermediate language separately, 
rather than identifying them with subsets of a small number of more general interme- 
diate languages (like for instance GCC does with its Tree and RTL representations). 

The general problem we face is that of transmitting information between compiler 
passes: what are the properties of its output that a compiler pass guarantees and that 
later passes can rely on? These guarantees can be positive properties (e.g. "the Mach 
code generated by the stack layout pass is compatible with treating registers as global 
variables") but also negative, "don't care" properties (e.g. "the LTL code generated by 
register allocation does not use temporary registers and is insensitive to modifications 
of caller-save registers across function calls"). 

Some of these guarantees can be captured syntactically. For instance, code produced 
by register allocation never mentions temporary registers. In Compcert, such syntactic 
guarantees for a compiler pass are enforced either by the abstract syntax of its target 
language or via an additional inductive predicate on this abstract syntax. As mentioned 
in section 4.3, all intermediate languages from Cminor to Mach arc weakly typed in a 
int-or-f loat type system. The corresponding typing rules are a good place to carry 
additional restrictions on the syntax of intermediate languages. For example, LTL's 
typing rules enforce the restriction that all locations used are either non-temporary 
registers or stack slots of the local kind. 

Other guarantees are semantic in nature and cannot be expressed by syntactic 
restriction. One example is the fact that LTL code generated by register allocation does 
not expect caller-save registers to be preserved across function calls. Such properties 
need to be reflected in the dynamic semantics for the target language of the considered 
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pass. Continuing the previous example, the LTL semantics captures the property by 
expUcitly setting caller-save registers to the undef value when executing a function 
call (section 8.1). Later passes can then refine these undef values to whatever value is 
convenient for them (section 11.3). 

In other cases, we must guarantee that an intermediate representation not only 
doesn't care about the values of some locations but actually preserves whatever values 
they hold. Just setting these locations to undef is not sufficient to capture this guar- 
antee. The approach we followed is to anticipate, in the semantics of an intermediate 
language, the actual values that these locations will take after later transformations. 
For example, the Linear semantics (section 11.1) anticipates the saving and reloading 
of callee-save registers performed by Mach code generated by the stack layout pass 
(section 12). Another instance of this technique is the semantics of Mach that antici- 
pates the return address that will be stored by PPC code in the slot of the activation 
record reserved for this usage (sections 12.1 and 14.2). While this technique of semantic 
anticipation was effective in Compcert, it is clearly not as modular as one would like: 
the semantics of an intermediate language becomes uncomfortably dependent on the 
effect of later compilation passes. Finding better techniques to capture behaviors of 
the form "this generated code does not depend on the value of X and guarantees to 
preserve the value of X" is an open problem. 

17.6 Toward machine language and beyond 

The Compcert compilation chain currently stops at the level of an assembly language 
following a Harvard architecture and equipped with a C-like memory model. The next 
step "down" would be machine language following a von Neumann architecture and 
representing memory as a finite array of bytes. The main issue in such a refinement is 
to bound the amount of memory needed for the call stack, as discussed in section 17.4. 
Once this is done, we believe that the refinement of the memory model can be proved 
correct as an instance of the memory embeddings studied by Leroy and Blazy [59, 
section 5.1]. Additionally, symbolic labels must be resolved into absolute addresses and 
relative displacements, and instructions must be encoded and stored in memory. Based 
on earlier work such as [70], such a refinement is likely to be tedious but should not 
raise major difficulties. 

The main interest in going all the way to machine language is to connect our 
work with existing and future hardware verification efforts at the level of instruction 
set architectures (i.e. machine language) and micro-architectures. Examples of such 
hardware verifications include the Piton project [69,70] (from a high-level assembly 
language to an NDL netlist for a custom microprocessor). Fox's verification of the 
ARM6 micro-architecture [34], and the VAMP project [14] (from the DLX instruction 
set to a gate- level description of a processor) . Sharing a specification of an instruction 
set architecture between the verification of a compiler and the verification of a hardware 
architecture strengthens the confidence we can have in both verifications. 

17.7 Toward shared-memory concurrency 

Shared-memory concurrency is back into fashion these days and is infamous for raising 
serious difficulties both with the verification of concurrent source programs and with 
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the reuse, in a concurrent setting, of languages and compilers designed for sequential 
execution [18]. An obvious question, therefore, is whether the Compcert back-end and 
its soundness proof could be extended to account for shared-memory concurrency. 

It is relatively easy to give naive interleaving semantics for concurrency in C mi- 
nor and the other languages of Compcert, but semantic preservation during compila- 
tion obviously fails: if arbitrary data races are allowed in the source Cminor program, 
the transformations performed by the compiler introduce additional interleavings and 
therefore additional behaviors not present in the source program. For instance, the eval- 
uation of an expression is an atomic step in the Cminor semantics but gets decomposed 
into several instructions during compilation. The weakly consistent hardware memory 
models implemented by today's processors add even more behaviors that cannot be 
predicted easily by the Cminor semantics. 

Our best hope to show a semantic preservation result in a shared-memory con- 
current setting is, therefore, to restrict ourselves to race-free source programs that 
implement a proper mutual exclusion discipline on their memory accesses. A power- 
ful way to characterize such programs is concurrent separation logic [79]. Using this 
approach, Hobor et al. [43] develop an operational semantics for Concurrent Cminor, 
an extension of Cminor with threads and locks. This semantics is pseudo-sequential in 
that threads run sequentially between two operations over locks, and their interleaving 
is determined by an external oracle that appears as a parameter to the semantics. It 
is conceivable that, for a fixed but arbitrary oracle, the Compcert proofs of semantic 
preservation would still hold. The guarantees offered by concurrent separation logic 
would then imply that the pseudo-sequential semantics for the generated PPC code 
captures all possible actual executions of this code, even in the presence of arbitrary 
interleavings and weakly-consistent hardware memory. This approach is very promis- 
ing, but much work remains to be done. 

18 Related work 

We have already discussed the relations between compiler verification and other ap- 
proaches to trusted compilation in section 2. Proving the correctness of compilers has 
been an active research topic for more than 40 years, starting with the seminal work 
of McCarthy and Painter [66]. Since then, a great many on-paper proofs for program 
analyses and compiler transformations have been published — too many to survey here. 
Representative examples include the works of Clemmensen and Oest [24], Chirica and 
Martin [22], Guttman ct al. [39], Miillcr-Olm [74] and Laccy ct al. [52]. Wc refer the 
reader to Dave's annotated bibliography [28] for further references. In the following, we 
restrict ourselves to discussing correctness proofs of compilers that involve mechanized 
verification. 

Milner and Wcyhrauch [68] were arguably the first to mechanically verify semantic 
preservation for a compilation algorithm (the translation of arithmetic expressions to 
stack machine code), using the Stanford LCF prover. The first mechanized verification 
of a full compiler is probably that of Moore [69,70], although for a rather low-level, 
assembly-style source language. 

The Verifix project [36] had goals broadly similar to ours: the construction of math- 
ematically correct compilers. It produced several methodological approaches, but few 
parts of this project led to machine-checked proofs. The parts closest to the present 
work are the formal verification (in P'VS) of a compiler from a subset of Common Lisp 
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to Transputer code [30] and of instruction selection by a bottom-up rewrite system 
[29]. 

Strecker [91] and Klein and Nipkow [49] verified non-optimizing byte-code compilers 

from a subset of Java to a subset of the Java Virtual Machine using Isabellc/HOL. They 
did not address compiler optimizations nor generation of actual machine code. Another 
verification of a byte-code compiler is that of Gregoire [37], for a functional language. 

The Verisoft project [80] is an ambitious attempt at end-to-end formal verification 
that covers the whole spectrum from application to hardware. The compiler part of 
Verisoft is the formal verification of a compiler for a Pascal-like language called CO 
down to DLX machine code using the Isabelle/HOL proof assistant [53,92,54]. The 
publications on this verification lack details but suggest a compiler that is much simpler 
than Compccrt and generates unoptimized code. 

Li et al. [63, 64] describe an original approach to the generation of trusted ARM code 
where the input language is a subset of the HOL specification language. Compilation is 
not entirely automatic: the user chooses interactively which transformations to apply, 
but the system produces formal evidence (a HOL proof term) that the generated ARM 
code conforms to the HOL specification. 

Chlipala [23] developed and proved correct a compiler for simply-typed A-calculus 
down to an idealized assembly language. This Coq development cleverly uses depen- 
dent types and type-indexed denotational semantics, resulting in remarkably compact 
proofs. Another Coq verification of a compiler from a simply-typed functional language 
to an idealized assembly language is that of Benton and Hur [9]. Like Chlipala's, their 
proof has a strong denotational semantics flavor; it builds upon the concepts of step- 
indexed logical relations and biorthogonality. It is unclear yet whether such advanced 
semantic techniques can be profitably applied to low-level, untyped languages such as 
those considered in this paper, but this is an interesting question. 

The formal verification of static analyses, usable both to support compiler opti- 
mizations or to establish safety properties of programs, has received much attention. 
Considerable efforts have been expended on formally verifying Java's dataflow-based 
bytecode verification; see [41,56] for a survey. Cacliera et al. [19] and Pichardie [82] 
develop a framework for abstract interpretation and dataflow analysis, formally veri- 
fied using Coq. A related project is Rhodium [55], a domain-specific language to de- 
scribe program analyses and transformations. Prom a Rhodium specification, both 
executable code and an automatically- verified proof of semantic preservation arc gen- 
erated. Rhodium achieves a high degree of automation, but applies only to the op- 
timization phases of a compiler and not to the non-optimizing translations from one 
language to another. 

19 Conclusions 

The formal verification of a compiler back-end presented in this article provides strong 
evidence that the initial goal of formally verifying a realistic compiler can be achieved, 
within the limitations of today's proof assistants, and using only elementary semantic 
and algorithmic approaches. It is, however, just one exploration within a wide research 
area: the certification, using formal methods, of the verification tools, code generators, 
compilers and run-time systems that participate in the development, validation and 
execution of critical software. In addition, we hope that this work also contributes to 
renewing scientific interest in the semantic understanding of compiler technology, in 
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mechanized operational semantics, and in integrated environments for programming 
and proving. 

Looking back at what was achieved, we did not completely rule out all uncertainties 
concerning the soundness of the compiler, but reduced the problem of trusting the 
whole compiler down to trusting (1) the formal semantics for the source (Cminor) and 
target (PPC) languages; (2) the compilation chain used to produce the executable for 
the compiler (Coq's extraction facility and the OCaml compiler); and (3) the Coq 
proof assistant itself. Concerning (3), it is true that an inconsistency in Coq's logic 
or a bug in Coq's implementation could theoretically invalidate all the guarantees we 
obtained about Compcert. As Hales [40] argues, this is extremely unlikely, and proofs 
mechanically checked by a proof assistant that generates proof terms are orders of 
magnitude more trustworthy than even carefully hand-checked mathematical proofs. 

To address concern (2), ongoing work within the Compcert project studies the 
feasibility of formally verifying Coq's extraction mechanism and a compiler from Mini- 
ML (the target language for this extraction) to Cminor. Composed with the Compcert 
back-end, these efforts could eventually result in a trusted execution path for programs 
written and verified in Coq, like Compcert itself, therefore increasing confidence further 
through a form of bootstrapping. 

The main source of uncertainty is concern (1): do the formal semantics of Cminor 
and PPC, along with the underlying memory model, capture the intended behaviors? 
One could argue that they are small enough (about 2500 lines of Coq) to permit manual 
review. Another effective way to increase confidence in these semantics is to use them 
in other formal verifications, such as the Clight to Cminor and iVIini-iVlL to Cminor front- 
ends developed within the Compcert project, and the axiomatic semantics for Cminor 
of [3]. Future work in this direction could include connections with architectural-level 
hardware verification (as outlined in section 17.6) or with verifications of program 
provers, model checkers and static analyzers for C-like languages. Drawing and for- 
malizing such connections would not only strengthen even further the confidence we 
can have in each component involved, but also progress towards the availability of 
high- assurance environments for the development and validation of critical software. 
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